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AN EXPERIMENTAL study was conducted at ferred to Dressel and Mayhew2 for the work 
the University of Minnesota!* to ascertain the done by a cooperating group of representatives 
effectiveness of the laboratory in developing the from colleges throughout the United States. A 
students’ ability to use the scientific method. recent study by Mason and Warrington? shows a 
The scientific method as an educational objec- more recent attempt to evaluate the techniques 
i tive has had increasing attention directed to- for developing the ability of the students to use 
“ ward it. The meaning of the concept—the sci- the scientific method as a tool of analysis. An- 
entific method—is still in the process of being other analysis was made by Lawson.4 The work 
é clarified. of Priestly, Lavoisier and Adams was investi- 
Two basic procedures have been utilized by gated and was found to contain similar factors. 
educators for the definition of the scientific The factors were identified as hypothesis, ex - 
method. One procedure utilizes the analysis of pectation and test. 
the problem-solving situation in terms of the It is not the purpose of this paper to review 
type of mental activities the solver of the prob- the current thinking and research in this area. 
lem uses, consciously or unconsciously. This It is, however, necessary to show that the means 
procedure leads to the identification of stages chosen for the teaching of the objective is relat- 
in the solution of the problem. An appropriate ed to the manner in which the objective is defined. 
name for this meaning of the scientific method If the scientific method—critical thinking —is de- 
is the philosophic meaning. The philosophic fined by identifying aspects of the problem-solv- 
meaning can be derived from an abstract situa- ing situation, then it logically follows that the 
tion, from the analysis of particular problems, identification of these same aspects innew prob- 
or from the analysis of the structure of science. lems is the means for attaining the objective. If 
The common element, regardless of which one the purpose is ‘‘....to give the students direct 
of the starting points is used, is the identifica- training in individually analyzing and critically 
tion of similar aspects of all problems. evaluating current scientific articles and topro- 
This analysis of the problem-solving situa- vide an Opportunity for group discussion of their 
tion led, originally, to the formulation of steps analyses and evaluations, '’Y then the means em- 
for the solution of problems. The advocates of ployed for the instruction is the analysis of prob- 
the formalized step-wise solution of problems lems. The scientific method, then, becomes 
are decreasing in number, apparently. It is one of analyzing someone else’s solution of a 
certainly worth noting that the step-wise solu- problem and identifying statements which are ob- 
tion is a highly stereotyped procedure and is servations, assumptions, hypotheses and c on- 
probably not too effective. clusions. Although the above discussion is lim- 
The philosophic meaning of the term, the ited, it should be clear that the scientific meth- 
scientific method, leads to another type of ac- od essentially duplicates the original philosoph- 
tivity which is gaining in importance. The crit- ical identification of these elements. 
ical evaluation of current scientific articles is The second way in which the scientific meth- 
a major general education objective. Certainly od has been defined is from an historical ap- 
the ability of the students to evaluate scientific proach—the research approach. This approach 
articles in magazines and newspapers should be emphasizes the difficulty of separating phases 
one of primary importance. The reader is re- of the problem and emphasizes the importance 






#Footnotes will be found at end of article. 
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of explicit and implicit assumptions made by 
the investigator. Conant6 shows the difficulty 
of clearly formulating the direction or method 
of the research when he says, ‘‘The stumbling 
way in which even the ablest of the early scien- 
tists had to fight through thickets of erroneous 
observations, misleading generalizations, inad- 
equate formulations and unconscious prejudice 
is the story which it seems to me needs telling.’’ 

For the purpose of this study where the lab- 
oratory is used as the means for developing the 
student's ability to use the scientific method the 
second definition appears to be the more appli- 
cable. If laboratory problems are to be used, 
the student should be placed in a creative situa- 
tion rather than in a situation in which critical 
analysis of a previous solution is required. 
George’ reflects this opinion when he writes, 
‘*Philosophy and logic alike are critical rather 
than creative. Whilst they may provide wisdom 
after the event, they give little help in creative- 
ness. The more a piece of research is creat- 
ive the more we physicists like it. Now, what- 
ever else it may be, scientific method is the tech- 
nique of research. Scientific method is the tac- 
tics and strategy of scientific research. ’’ 

As the historical or research definition of 
the scientific method was chosen to be the ap- 
plicable one for a laboratory situation, the teach- 
ing procedures and the tests used to measure 
growth must reflect this meaning. Therefore, 
the written tests were constructed to measure 
the ability of the students to design an expe ri- 
ment to collect the data and to measure the abil- 
ity of the students to interpret the data. A per- 
formance test was constructed to measure the 
ability of the students to solve a problem. The 
problem was to be solved by using the simple 
equipment which was placed before the students. 

The Design an Experiment Test required the 
students to write an explanation of how they would 
proceed to collect data. The procedure they 
employed should give data from which the con- 
clusion could be drawn. The procedure, which 
the students would use, utilized the materials 
given for the experiment. The students in their 
explanations were expected to show critical- 
mindedness, the use of controls and the other 
sale-guards of experimentation. This type of 
question, of course, did not measure the pro- 
cess of reasoning or the step by step aspects of 
problem-solving. It measured the ability of the 
students to arrive at a correct way of solving 
the problem with the equipment which was given. 
Therefore, the Design an Experiment questions 
were constructed to measure the ability of the 
student to reorganize his knowledge and formu - 
late a plan for solving the problem. We contend 
that this is a different ability than the ability of 
analyzing choices and choosing the best one. 

As the laboratory was used for the experi - 





mentation, problems had to be chosen which 
could be answered from the data of an experi- 
ment. This type of problem has been called an 
inductive problem. The inductive problem leads 
to the establishment of a scientific law if, with- 
in the limits set by experimentation, contradic- 
tory results have not been obtained. Paren- 
thetically, by analyzing the establishment of a 
scientific law, the philosophic meaning of the 
scientific method, the terms hypothesis and the- 
ory denote the probability of the generalization 
being true. If, historically, the experiment is 
verified many times but has never been contra- 
dicted within the limits of the experiment, the 
hypothesis and then the theory, by induction, 
gain the status of a scientific law. Boyle’s Law 
and Charles’ Law are examples of this type of 
conclusion. Perhaps a better term than the in- 
ductive method for this part of the scientific 
method would be the empirical method. At least, 
among the philosophers of science, see Feig! 
and Sellars®, the term empirical lawis the more 
common term. 

In the physical sciences there is another im- 
portant facet of scientific methodology. The sci- 
entific method includes this facet, too. A higher 
order theory is derived analytically from a set 
of assumptions—the kinetic theory, for example 
—and not from data. The kinetic theory cannot 
be logically deduced from Boyle’s Law, Charles’ 
Law, Graham’s Law, specific heats and other 
characteristics of gases. The kinetic theoryis 
a mathematico-logical deduction from a set of 
assumptions concerning molecules. The laws 
describing the behavior of gases can be deduced 
from the theory. The theory is verified by de- 
ducing the predicted behavior of the gases and 
by testing these predictions in the laboratory. 

If the experiment refutes the theory, it is not 
rejected or discarded as an empirical law is. 
The theory is modified, if possible, to account 
for the discrepancy between the prediction and 
the experiment. The discovery that the specif- 
ic heats of certain gases, for example oxygen, 
were not predicted by assuming small spherical 
molecules did not result in the overthrow of the 
kinetic theory. Instead, the assumption was 
considered wrong and the subsequent derivation 
of a higher order theory explained the discrep- 
ancy. The derivation of a theory from a set of 
assumptions is called the hypothetico-deductive 
theory by Feigi.9 This phase of scientific meth- 
odology cannot be included in the inductive or 
empirical laboratory for general education stu- 
dents. It is important, however, for the spec- 
ialist in physics or chemistry and is, essenti- 
ally, the reason for the difference in meaning 
attached to the concept—the scientific method— 
by the specialist. It is clearly related to the 
understanding of the higher order theories. 

After the analysis of the meaning of the sci- 
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entific method and its appiication to the labora- 
tory situation, it can be perceived, readily, that 
the problem-solving situations and the tests must 
conform to this meaning. The problem, it was 
noted, must be of an empirical type and must be 
soluble with relatively simple equipment. The 
tests which are utilized must be of a lower type 
than the hypothetico-deductive kind. The induc- 
tive-deductive situations meet the criteria in 
both cases. 


The Teaching Situation, Population, and 
Teaching Methods 








The experiment was conducted in a natural 
science class at the University of Minnesota. 10 
The natural science course is one of the gener- 
al education courses in the General Studies De- 
partment. The students for whom the courses 
are designed are non-science majors. Profes- 
sor Graubard was in charge of the course. The 
course is a three quarter sequence, carrying 5 
credits each quarter. The class meets five 
times a week for a lecture-demonstration. A 
one-hour laboratory is required of all the stu- 
dents enrolled in the course. The experiment 
was concerned with the utilization of this labor- 
atory in teaching for the general education ob- 
jective—the scientific method. The content of 
the laboratory, however, had to be adapted to 
the total course program. 

The total number of students enrolled in the 
course was three hundred thirty-eight, 48 per- 
cent of whom were boys and 52 percent girls. 
Of the total students enrolled in the class, 53 
percent were freshmen, 33 percent sophomores, 
10 percent juniors, and four percent seniors. 
The largest percent of the students were en- 
rolled in the College of Science, Literature and 
Arts (/5 percent), with the remaining students 
in Education (24 percent), and the General Col- 
lege (1 percent). 

In attempting to establish the population from 
which the experimental group isa sample, a 
comparison was made with a subsequent class. 
The class which was compared with the experi- 
mental group had a total enrollment of three 
hundred ninety-five students. The composition 
of the subsequent class was analyzed and was 
found to contain 46 percent boys and 54 percent 
girls. Fifty-seven percent of the class were 
freshmen, 34 percent were sophomores, 5 per- 
cent were juniors, and 4 percent were seniors. 
The largest percent of the students were in the 
College of Science, Literature and Arts (79 per- 
cent), with the remaining students in Education 
(21 percent), and a miscellaneous group con - 
sisting of I. T., General College, Adult Special 
and Agriculture (2 percent). On the basis of 
external characteristics the experimental class 
resembled the subsequent class very closely. 
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Because of the close similarity in the distribu- 
tion of students, the experimental sample can 
be a random sample of the students enrolled at 
the University of Minnesota in the natural sc i- 
ence class. 

One other characteristic was investigated. 
After choosing the first student at random, a sys- 
tematic sample of fifty-eight students was ob- 
tained from the comparative class. The A.C.E. 
Psychological Examination scores were obtained 
from the Counseling Bureau. The A.C.E. Pbsy- 
chological Examination scores were com par ed 
with a sample of sixty students from the experi- 
mental group. The sample of students from the 
experimental group was obtained in the same 
manner as the sample from the comparative 
class. An analysis of the scores showed that 
the mean of the experimental sample—112. 85— 
was not significantly different from the mean of 
the comparative sample—109. 64. The t-test is 
the appropriate test to test the significance be- 
tween the means of two random samples. The 
t-test can be used if the variances of the two 
samples are homogeneous. The two samples are 
homogeneous if the variances are not significant- 
ly different. The standard deviation, the square 
root of the variance, of the experimental sample 
was 18.45 and the standard deviation of the com- 
parative sample was 16.09. The variances were 
shown to be homogeneous at the 5 percent level 
by means of the L;-test. 

Since the study was conducted in the labora- 
tory the effect of the lectures was assumed to 
be the same for each experimental group. The 
analysis of variance of the final lecture examin- 
ation score indicated that there was no signifi- 
cant difference in means among the experiment- 
al groups. The students of all the sections in 
Study One had the same laboratory instructor. 
It was assumed that this did not influence the re- 
sults. 

The background of the students was investi- 
gated. A 2x 2 chi square classiftcation table 
was formed from the experimental methods and 
from the number of units of high school mathe- 
matics and science. The test of significance 
shows that the two classifications were inde - 
pendent. There was no significant difference 
among the experimental groups due to their back- 
ground in science and mathematics. 

The four methods of teaching which were com - 
pared in the experiment were the inductive- de - 
ductive, historical, theme and standard meth- 
ods. The discussion method was substituted for 
the theme method in the second study. 

The inductive-deductive method was defined 
as the problem-solving method. The inductive 
method is commonly defined in educational lit- 
erature as proceeding from the observed indi- 
vidual events to the generalization. The deduc - 
tive aspect of the scientific method in a prob - 
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lem-solving situation requires the application of 
a generalization which was ‘‘discovered’’ in the 
laboratory to another problem. For the pur- 
poses of this study the inductive aspect of the 
scientific method was defined as proceeding 
from the individual events, which were observ- 
able in the laboratory, to a generalization. In 
a laboratory situation the generalization will 
usually be a law. The scientific law desc ri bes 
the relationship between two variables, for ex- 
ample, Newton's Second Law. Using this defin- 
ition, Newton’s First Law is nota law. The in- 
ductive method, when it refers to the formula- 
tion of a law, should be called, more properly, 
the empirical method. The laboratory is not 
the proper place to handle the inductive method 
in terms of theory construction. The student 
can deduce from a hypothetico-deductive theory 
to a new problem, however. The problem-solv- 
ing method defined in this way places a premium 
upon the knowledge of facts and principles of sci- 
ence and the ability to reorganize them to suit 
the problem. 

The laboratory, using this method, was con- 
ducted in small groups or on an individual basis. 
The problem was posed by the instructor. The 
individual student, or at most two or three stu- 
dents together, were left to develop a means of 
attacking the problem and analyzing the results. 
the instructor moved from one individual to an- 
other and aided each by asking leading questions 
concerning the problem. This problem-solving 
approach differs from the others which are re- 
ported in the literature by having the individual 
students attack the problem. Each individual or 
group formulates the means by which the prob- 
lem is attacked instead of having the entire class 
decide upon a particular method and then each 
group performing the experiment. Having each 
individual or group design their own experiment 
would tend to take the student from his present 
ability in problem-solving of this kind to as far 
as possible. The number of questions and the 
simplicity of the questions asked of each individ- 
ual depended upon the insight of the student. If 
the student perceived the implications of the ques- 
tion, the problem was clarified rapidly. If the 
student did not perceive the implications he had 
to be led to the design of the experiment by a 
more complete sequence of questions. Consid- 
erable variation among the individuals should be 
expected. ‘‘The problem-solving approach was 
operationally defined as the ability to design, 
ruditmentarily at least, an experiment to collect 
data, and then to analyze and interpret the data 
collected in order to arrive at an acceptable con- 
clusion which is consistent with the data. ‘’!1 

The historical method was an adaptation of 
Conant’s Case History Method to the laboratory 
situation. Conant’s Case Studies are concerned 
with the development and clarification of a con- 





(Vol. 24 


ceptual scheme. Those topics or conceptual 
schemes which require a great deal of modifica- 
tion from the old to the new cannot be handled 
adequately in the laboratory situationalone. The 
historical method, as adapted for this study, was 
not a problem-solving approach. The emphasis 
was placed upon an understanding of the difficulty 
of isolating the factors which are important in 
the experiment. Where possible, the original 
reading material, or a translation of the origin- 
al, was assigned to be read previous to attend- 
ing the laboratory class. The writing was anal- 
yzed for critical reasoning employed by the writ- 
er of the material. Analysis of the material 
was patterned after Dr. Graubard’s technique 
for teaching the history of science. The read- 
ings were analyzed in terms of criticalness with 
which the experimenter attacked the problem. 
What things did he question? What did he accept 
as true that other scientists of the period be- 
lieved? What assumptions did he make? What 
experiments did he perform and how was he led 
to the experiment? The experiment, then was 
performed and interpreted. The factors neces- 
sary for experimentation, such as controls, sig- 
nificant figures, accuracy and openmindedne ss, 
were emphasized while the conclusion was be- 
ing made. A summary of the experiment was 
handed in. In many cases these reports would 
have the appearance of a problem-solving sum- 
mary. It was not problem-solving, however, be- 
cause it tended to verify the conclusion which 
was already known through the readings or the 
discussion. The analysis of a problem-solving 
situation for aspects of the scientific method 
may be interpreted as the philosophic meaning 
of the scientific method. 

The standard type laboratory was defined as 
the usual, descriptive type. The instructor de- 
scribed the experiment in detail before the stu- 
dents performed the experiment. Considerable 
attention was given to the difficulties the stu- 
dents might have in performing the experiment 
and in interpreting the results. The value of 
controls, and of the other aspects of experiment- 
ation, was emphasized before the experiment 
was performed. This approach differed from 
the manner in which it was handled in the prob- 
lem-solving laboratory. In the problem-solving 
laboratory the question of controls, significant 
figures, etc., were introduced when the student 
attempted to make an interpretation of the data. 
The student at this time would see the need for 
a control and could be led to use a control by a 
leading question. As the students in the stand- 
ard laboratory were told what to do and what to 
observe, the experimentation was of the typical 
‘*cookbook’’ variety. 

The theme method was used in Study One. 
Overbeck!2 used the energy theme for a natural 
science class which he taught at Northwestern. 
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Because the utilization of energy is the basis of 
our civilization, it was chosen as a represent- 
ative and important unifying theme. Although 
the energy theme is of value in interpreting sci- 
ence in our civilization, it was too difficult to 
adapt to the laboratory situation alone. There- 
fore, the theme method was unsatisfactory and 
in many cases became another standard labor- 
atory. 

In Study Two the theme method was replaced 
by the discussion-recitation method. In the dis- 
cussion-recitation method the students did not 
perform any experiments. The historical as- 
pects of the problem were discussed. The at- 
tention of the students was directed toward the 
analysis of the historical problem in terms of 
the assumptions which were made by the exper- 
imenter, the alternate hypotheses which could 
be experimentally tested, and the interpretation 
of the results. While interpreting the results 
of the experiment, the characteristics of a 
straight line graph and the importance of con- 
trols, significant figures, etc., were empha- 
sized. That is, the historical problem was an- 
alyzed for the characteristics of the scientific 
method. The discussion was extended to in- 
clude examples of the concept that the class 
had experienced. The social implications of 
the results were discussed. Any aspect of the 
problem was discussed which was brought up 
by the students. The discussion, after the an- 
alysis of the problem itself, followed the ques- 
tions and statements of the students and was not, 
therefore, a repeatable class. That is, the 
methodology of one discussion- recitation group 
could not be exactly duplicated in the others. 


The Design of the Experiment 





Natural Science I and II, the reader will re- 
call, are large classes. The student does not 
register for a particular laboratory hour when 
he registers for the course. The first day the 
student attends the class he fills out a schedule 
which includes the hours of all his classes and 
his work schedule. From an inspection of the 
class schedules the students are assigned to a 
laboratory section. From previous experience 
with the laboratory sections it was known that 
students would prefer, and in many cases find 
it mandatory, to have their laboratory section 
during one of a few choices. The student who 
worked in the afternoon and had a three hour 
course in the morning would want a la batory 
class in the mornings on the days the three hour 
course did not meet. If the student hada ten 
o'clock class which met Monday, Wednesday 
and Friday, he would want a laboratory which 
met on Tuesday or Thursday at ten o’clock. A 
sizeable number of students went to work im - 
mediately after their last class. A group of 
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students who were not employed and had other 
classes in the afternoon would rather havea lab- 
oratory section in the afternoon. 

The design of the experiment must consider 
the characteristics of the students which were 
cited above, as well as satisfy the requirements 
of a self-contained experiment. Johnson states 
that, ‘‘The purpose... .of making an experiment 
self-contained is to make possible the valid and 
unequivical interpretation of its results without 
referring for decision or settlement or consid- 
eration to other experiments or to the aggregate 
or experience of prior collection. ''13 

There are three requirements to be satisfied 
by a self-contained experiment. 14 The first of 
these requirements is randomization. Random - 
ization is essential in statistical experimenta- 
tion for the tests of significance to be valid and 
for the estimates of treatment effects to be un- 
biased. Randomization is essential to insure 
against biases, known or unknown, which may 
introduce a systematic error. This is accom- 
plished by assuring that whatever source of er- 
ror may affect the experimental results, also, 
with equal probability, affects the estimate of 
error. 

The second requirement of a self-contained 
experiment is replication. The precisionof the 
experiment depends upon the replication. Rep- 
lication provides the only means of estimating 
the experimental error. This experimental er- 
ror decreases in size as the number of replica- 
tions increases, providing, of course, that 
there is no increase in the heterogeneity of the 
experimental groups or that there is no greater 
carelessness in the use of techniques. 

The third requirement of a self-contained ex- 
periment is a control or controls. The control 
allows the comparison of the experimental 
groups. The control may be another exper i- 
mental group. Alli the treatments directly com- 
pared including the control, if specified, must 
be compared upon the same experimental mater- 
ial. It is by local control, e.g., randomized 
blocks, that replication leads to a reduction of 
experimental error. 

A design which meets the requirements of a 
self-contained experiment and which meets the 
requirements of the situation is the incomplete 
block design. The incomplete block design can 
be used in two situations. If the number of ex- 
perimental groups in which randomization can 
be achieved is less than the number of treat- 
ments which are to be compared, the incom - 
plete block is an appropriate design. 

In this case the incomplete block design is 
used to assure randomization of the individuals. 
As the reader will recall, the students’ involved 
in the study tend to have the same hour free on 
Tuesday and Thursday or on Monday, Wednes- 
day and Friday. The minimum number of days 
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at the same hour within which randomization 
can be achieved is two. A complete replica- 
tion of four treatments could not have been de- 
signed in which all the comparisons would be 
made with equal precision. Another consider- 
ation would vitiate the use of a randomized 
block experiment. The students could take lab- 
oratory during particular times of the day only. 
Hence, the replications would be selective. 
Therefore, systematic error or bias could 
arise which would affect the experimental re- 
sults. The principle of randomization would 
be violated. Therefore, the incomplete block 
design was chosen to assure randomization in 
the experiment. 

The incomplete block is also useful when 
the homogeneity of the experimental groups 
varies considerably. The number of treat- 
ments may be larger than the number of homo- 
geneous groups that are available. For exam- 
ple, at Western Washington College of Educa- 
tion the freshmen students are divided into 
three groups on the basis of the entrance ex- 
amination in English. The sequence of classes 
for the three groups vary. Any experimenta- 
tion which would involve these freshmen stu- 
dents introduces the factor of selection by 
means of the entrance examination in English. 
The incomplete block design would be an appro- 
priate design to control the selection of the stu- 
dents, and to control the homogeneity of the 
experimental groups. Besides controlling the 
homogeneity of the experimental groups, the 
incomplete block design gives a means of esti- 
mating the effect of the grouping. 

Because this experiment may be the first in 
educational research which utilizes the incom- 
plete block design, the author would like to dis- 
cuss the characteristics of the design in more 
detail. The designs are arranged in blocks 
that are smaller than the number of treatments. 
Table I shows the incomplete block de sign ar- 
ranged in separate replicates. Replication I 
is composed of students who could attend a lab- 
oratory scheduled at 8 o'clock. These students 
would tend to have Monday and Wednesday or 
Tuesday and Thursday free at 8 o'clock. There- 
fore, Monday and Wednesday became one block 
of the replication and Tuesday and Thursday the 
other block. If there were insufficient students 
available to fill the blocks, those students who 
were available on any day of the week at 8 o’- 
clock were picked randomly to fill the classes. 
The other replications were chosen and com- 
pleted in a similar manner. 

Alter the students were assigned to their 
block and replication, further randomization 
was required. 15 As the design was arranged 
in complete replications, the blocks were ran- 
domized within each replication. That is, the 
randomization determined whether Block 1 
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would be composed of the Monday-Wednes da y 
classes or of the Tuesday-Thursday classes. 
The treatment numbers within each block were 
randomized. The treatments were assigned 
randomly to the numbers. The students within 
each block were then randomly assigned to the 
experimental classes or methods. The samp- 
ling procedures that were used should fulfill the 
second criterion for randomness. This criter- 
ion requires the sampling to be independent of 
the treatment being measured. 

The inductive-deductive method is compared 
with the historical method in Replication I, and 
are in the same experimental block. The stu- 
dents in this block were randomly assigned to 
the methods and the means of the treatmentsare 
directly comparable. In Replication Il and III, 
however, the inductive-deductive method and 
the historical method are in different blocks. 
Therefore, the average means of the treatments 
are not directly comparable. The differences 
between the blocks in Replications II and III are 
mixed up, that is, confounded with the treat- 
ments. This is a unique feature of the incom- 
plete block design. The treatments must be ad- 
justed for the block effects. The adjustment 
may be made upon each observational score or 
it may be made upon the average score within 
the block. The two different adjustments lead 
to two different solutions of the statistical prob- 
lem. The difference between the two solutions 
will be clarified if we examine the mathematical 
model. 

The mathematical model for a single obser- 
vational value can be expressed, ‘‘....as the 
sum of four components: (i) a general average 
about which the observations are presumed to 
be fluctuating; (ii) a component representing the 
effect of the treatment applied; (iii) a component 
representing certain environmental effects 
which the design of the experiment enables us 
to isolate; and (iv) a residual component, rep- 
resenting all the other sources including errors 
of measurement that influence the observation, 
and generally referred to as ‘experimental er- 
ror’.''16 That is, the score of an individual, 
Yjj, on a test is composed of a mean effect, 
the general mean of all individuals in the exper- 
iment; a component due to differences in treat- 
ments; an effect due to the experimental block; 
and an experimental error. It is assumed that 
the four factors are additive. The mathemati- 
cal model can then be written: 


Yij = nyjl a + Ty + Bj + e4j) 


In this experiment i refers to the number of 
treatments and takes the values 1, 2, 3, or 4. 
The number of experimental blocks, j, takes 
the values 1, 2, 3, 4, 5, or 6. The value of 
njj is equal to lor 0. It is 1 if the treatment 
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is in that particular block, and itis 0 if the 
treatment is not in the block. 

The statistical solution of the problem de- 
pends upon the assumptions made about the 
four factors in the mathematical model. If we 
assume that the mean, the treatment and the 
block effects are averages and that the experi- 
mental error is normally and independently dis- 
tributed with a mean of zero and witha vari- 
ance of o,*, the statistical solution utilizes 
the ‘‘intra-block’’ information. The experi- 
mental error in this case is called the intra- 
block error. If no assumption is made concern- 
ing the variability of the block effect, the block 
effect is fixed and is the same for each individ- 
ual. Therefore, the observed variation withi. 
the block must be due to experimentalerror. It 
is this variation which is called the intra-block 
error. 

The mathematical model shows the observed 
value, Yjj, is equated to the expected total. 
The least squares solution of this mathematic - 
al equation leads to an expectation equation for 
each of the various components. The equations 
for treatment effect and block effect show that 
the adjustment which was mentioned above can 
be made by subtracting from the treatment to- 
tals one-half the sum of all the blocks in which 
the treatment occurs. For example, the treat- 
ment total for the first experimental method 
would be obtained by adding the test scores of 
each individual in all three replicates. The to- 
tal sum of these test scores will contain the ef- 
fect due to the block. The block effect can be 
removed from the treatment effect by subtract- 
ing one-half of the total sum of the test scores 
within the block in which treatment occurs. The 
one-half represents the amount the experiment- 
al class contributes to the block total. There 
are two experimental classes in one block. 
Therefore, one-half of the block total is due to 
each one of the experimental classes. 

The analysis may be made by utilizing the 
variation of the observational scores among 
the blocks. That is, rather than assuming a 
fixed block effect, the block effect is assumed 
to vary. The experimental error in this case 
is called the inter-block error. If the block 
effects vary, additional assumptions must be 
made. ‘‘The additional assumptions are made 
that the block effects are normally and inde- 
pendently distributed with zero means and var- 
iances 0, and that they are independent of the 
error variances.’’!7 The validity of the as- 
sumptions must be determined by experimenta- 
tion and may be found to be inapplicable to cer- 
tain kinds of data. The above assumption im- 
plies that the observations in a single class 
are positively correlated and the analysis is 
changed to account for the correlation. 





The estimate of the experimental error for 
two observations within a single block is the 
intra-block error discussed before. The esti- 
mate of the experimental error for two observa- 
tions in different blocks must include the block 
variation. That is, if the blocks are assumed 
to be randomly selected, the two observations 
in different blocks contain two different esti - 
mates of the block variance, 0, as well as two 
different estimates of the intra-block error, o¢ 
Therefore, ‘‘The basic change in the model is 
that the £’s are assumed to be random effects 
with the same variance, oj, the plot-to-plot 
(intra-block) error being designated by oy. "18 

We have two independent estimates of the 
treatment effect. As each one of the two esti- 
mates has a different error variance, a weighted 
average of the two estimates must be found. The 
mathematical model is minimized by the maxi- 
mum likelihood method to obtain the mathemati - 
cal equations for the two weights. The weights 
of the two estimates are not known. Yates! has 
shown that they can be estimated from the data 
of the experiment. The test of significance for 
the null hypothesis does not take into considera- 
tion the variation of the weights due to sampling. 
When the means of the treatments are adjusted 
for block effects, the weights are considered to 
be constant. For large experiments the weights 
should be fairly constant. For small experi- 
ments the sampling variation in the weights may 
become excessively large. Therefore, ‘‘If there 
are fewer than 15 degrees of freedom..., the 
authors do not recommend using this weighted 
analysis...''20 Since this experiment contains 
5 degrees of freedom for among blocks variation 
the intra-block analysis was used. 


The Analysis of the Experimental Results 


The design of the experiment and the samp- 
ling process have been discussed. ‘It is the de- 
sign of the experiment and the randomization 
procedure which determine the statistical tech- 
niques used for the analysis of the results. The 
incomplete block analysis was made using the 
intra-block estimate of experimental error. The 
design, the reader will recall, utilizes the total 
score of the experimental unit and the total score 
of the block. We speak of the experimental unit 
as containing several observational units: the 
class of students receiving a certain treatment: 
is the experimental unit while the individual stu- 
dents are the observational units. 

The experimental units were used to investi- 
gate the main purposes of the study which were: 
(1) to design or use laboratory experiments 
which would reflect the meanings of the scien- 
tific method; (2) to design instruments to meas- 
ure the ability of the students to use the scien - 
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tific method; (3) to evaluate the effectiveness 

of the teaching methods. Several specific ques- 
tions were involved in evaluating the effective- 

ness of the teaching methods. These questions 
were re-phrased to form the following hypothe- 
S8e8: 


1. The use of laboratory experiments of 
a problem-solving nature does not 
lead to a greater resourcefulness in 
solving physical science problems 
which are new to the student. 


. The use of laboratory experiments of 
a problem-solving nature does not 
lead to a greater resourcefulness in 
designing an experiment to find the 
answer to a scientific problem. 


. The use of laboratory experiments of 
a problem-solving nature does not 
lead to a greater resourcefulness 
in interpreting the results of an ex- 
periment. 


. The use of laboratory experiments of 
a problem-solving nature does not 
lead to a greater utilization of facts 
and principles in the solution of a 
problem. 


. The work schedules and the time of 
the day the students prefer a labora- 
tory do not effect the mean scores of 
the students. 


The analysis of the results of the study was 
concerned with testing the significance of each 
of the null hypotheses. The hypothesis states 
that the means of the methods are allequal. 
The null hypothesis is phrased in this way to 
allow the refutation of the hypothesis if the ev- 
idence permits. The burden of the proof ison 
the data. 

The observational data indicated that the 
mean scores of the inductive-deductive m et h- 
od were higher on all parts of the tests in Study 
One. On the Interpretation of Data Test, the 
mean scores of the historical and standard 
method were a close second to the induc tive- 
deductive method. The mean score of the stu- 
dents in the theme method was slightly lower 
than the mean score of the historical and stand- 
ard methods. 

In the Design an Experiment Test, the mean 
scores of the inductive-deductive method ap- 
peared to be considerably higher than the mean 
scores of the other three methods. The mean 
scores of the Total Written Test would, of 
course, reflect this same difference between 
the inductive-deductive method and the mean 





scores of the other three methods. 

On the Performance Test the mean score of 
the inductive-deductive method was higher than 
the mean score of the standard method. The 
mean score of the standard method was higher 
than the mean score of the historical method by 
as much as the mean score of the historical meth- 
od was higher than the mean score of the theme 
method. 

All of the above tests were found to be reli- 
able at the 1 percent level of significance or less. 
The mean scores of the A.C. E. Psychological 
Examination in Study One and Study Two showed 
considerable variation. The covariance tech- 
nique, where it is applicable, will equalize the 
inequalities due to general ability as measured 
by the test. Therefore, these differences are 
not crucial. 

The same apparent pattern was noticeable in 
Study Two with one exception. The mean score 
of the discussion method of the Performance 
Test was considerably below the mean scores of 
the other methods. One might conclude that this 
was because the students did not perform any ex- 
periments or handle equipment in the laboratory. 
The application of the t-test showed that signifi- 
cant growth occurred in the ability to solve prob- 
lems with the equipment despite the handicap of 
not performing experiments. The t-test showed 
that learning occurred for each of the methods. 
The means of the inductive-deductive method and 
of the historical method apparently increased 
more than for the other two methods. The anal- 
ysis of the experimental results will determine 
whether any of the apparent differences cited 
can be explained on the basis of chance variation. 
If a significant difference is found among the 
means of the methods, presumably, that differ- 
ence will be due to the different teaching meth- 
ods. 

All of the analyses cannot be reviewed in this 
summary. Therefore, the following examples 
are used to show the techniques employed. Be- 
fore the analysis of data can be undertaken, the 
assumptions made in the mathematical model 
must be tested. The observational data were 
found to agree with the assumptions under! y ing 
the mathematical model in most cases. What- 
ever conclusions were drawn from the studies 
were drawn from analyses which could logically 
be made. 

The incomplete block analysis of the Design 
an Experiment Examination was made tc test the 
significance of the difference in means of scores 
among the methods. The incomplete block anal- 
ysis of variance is shown in Table II. The F- 
ratio, which is an exact test for the intra-block 
error analysis, of 7.91 is less than the 5 per- 
cent tabled value of 9.28. The null hypothesis 
is accepted on the basis of the F-ratio. The an- 
alysis of variance revealed no significant differ - 
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TABLE I 
THE INCOMPLETE BLOCK DESIGN 





Replication I Replication II Replication III 





Block 1 Block 3 Block 5 


Inductive -deductive Inductive-deductive Inductive -deductive 
vs vs vs 
Historical Theme Standard 





Block 2 Block 4 Block 6 


Theme . Historical Historical 
vs vs vs 
Standard Standard Theme 

















TABLE ll 
ANALYSIS OF VARIANCE OF SCORES ON THE DESIGN AN EXPERIMENT TEST 





Source of Sum of Mean Null 
Variation .F. Squares Square Hypothesis 





Treatments (adj. ) 1794. 2500 344. 3333 ° accepted 


Blocks (unadj. ) 1721. 6667 598. 0833 accepted 


Intra~Biock Error _3 226. 7500 75. 5833 


Total 3742. 6667 
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ence in means among the teaching methods. The 
observed difference in favor of the induc tive- 
deductive method could be explained by chance 
variation. Therefore, the ability of the students 
to design experiments was not significantly in- 
fluenced by the different teaching methods. The 
F-ratio of 7.91 does approach the level of signif- 
icance and falls in the 5-10 percent probability 
range. In only 5-10 times out of 100 inthe long 
run will such a difference appear by random 
observation. The educational conclusion is that 
the teaching methods did not produce significant- 
ly different mean effects as measured by the De- 
sign an Experiment Test. It is apparent that, al- 
though no statistically significant results were 
obtained, the data for the groups vary consider - 
ably and tend to favor the inductive-deduc tive 
method. 

The F-ratio for the among blocks variation 
of 4. 56 is less than the 10 percent tabled value 
of 5.31. The null hypothesis is accepted. The 
experimental conclusion is that factors used to 
classify the students, such as the time of day, 
the work schedules and the pattern of the stu - 
dents’ courses, did not significantly alter the 
mean scores of the students. 

There are only 3 degrees of freedom for er- 
ror in the incomplete block analysis. Presum- 
ably, if the number of degrees of freedom could 
be increased by utilizing the variation of the ob- 
servational scores within the experimental unit, 
the precision of the analysis would be increased. 

The incomplete block design which was used 
is arranged in complete replication. Yates?21 
has shown that replicated incomplete biocks can 
be analyzed as if they were randomized blocks 
if the block effect is non-significant. Therefore, 
the analysis was made as if the experiment were 
a three by four randomized block experiment. 
No conclusion, of course, can be drawn concern- 
ing the differences among replications. 

Because the F-ratio for the incomplete block 
analysis approached significance, itis no sur- 
prise to find that when the information within the 
experimental unit is used the F-ratio becomes 
significant. The randomized block analysis re- 
sults in an F-ratio greater than the tabled value 
at the 1 percent level. The hypothesis of equal 
means is rejected. The rejected null hypothe- 
sis is interpreted to mean that there is a signif- 
icant difference among the means of methods, 
as measured by the Design an Experiment Test. 
The results of this test were analyzed by remov- 
ing the inequalities due to variation of general 
ability as measured by the A. C. E. Psycholog- 
ical Examination by covariance analysis. The 
covariance analysis did not produce a greater 
significant change in the results. As the differ- 
ences in general ability are shown to be a neg- 
ligible factor in the achievement, the differences 
measured are attributed to the methods of in - 








struction. A logical question to ask is: Are any 
method or methods statistically different than 
the other methods? The F-test indicates that 
there is a significant difference in means among 
methods of instruction when the inequalities due 
to general ability have been removed. The F- 
test does not lead to an unequivocal answer to the 
question, however. Scheffe has developed a 
method of judging whether the observed differ- 
ence between two means or two sets of meansis 
significant. Contrasts are formed of the treat- 
ment means. The means of the contrasts areas- 
sumed to be equal and the covariance between the 
means are assumed to equal zero. A confidence 
interval of the estimated variances, the square 
of the standard deviations, is calculated. If the 
confidence interval is larger than the chance fluc- 
tuations of the means, the confidence inter val 
overlaps zero. When the confidence interval 
overlaps zero the difference in the treatment 
means is regarded as non-significant. If the con- 
fidence interval does not include zero, the differ- 
ence between the means is considered significant. 
When covariance variables are included in the 
analysis, the adjusted means roust be used. The 
mean of the inductive-deductive method was ad- 
justed by correcting for the regression effect due 
to general intelligence. The mean of each of the 
other three methods was adjusted, also. The ad- 
justment changes the mean of each contrast to 
what it would be if the two groups of students 
had the same mean score on the A. C. E. Psy- 
chological Examination. The contrast of the ad- 
justed means did not overlap zero. Therefore, 
as the confidence interval does not overlap zero, 
the inductive-deductive method is considered bet- 
ter than the other methods in promoting the stu- 
dent's ability to develop and to decide on a pro- 
cedure to follow when he wants the answer toa 
problem. The other methods were considered 
internally consistent because the contrast over - 
lapped zero. The observed differences among 
the other three methods were not significant. 

The Design an Experiment Test was one part 
of the Written Test. It was mentioned earlier 
that the Written Test and the Performance Te st 
were constructed to measure the same ability: 
that is, the ability to solve empirical problems. 
It was assumed that the same type of problem- 
solving was involved in the Performance Test as 
in the Written Test. The assumption might not 
be tenable. 

Moonan23 has generalized some experimental 
designs used in multivariate analysis. The gen- 
eralized method characterizes each of the two 
tests asa vector. The total vector, or meas- 
urement of the objective, is obtained by adding 
vectorially the two test scores. Notice that the 
characterization of the test scores as vectors 
utilizes the differences in the two tests. If the 
assumption that the tests are equal is not ten - 
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able, then each test measures something which 
the other test does not measure. By combining 
the two tests and adding the unequal portions, a 
more comprehensive measurement is obtained. 
That part of the two tests which measures the 
same thing must be subtracted from the scores 
of the two tests. This subtraction is achieved 
by calculating the covariance effect of the two 
tests. The covariance effect of the A. C. E. 
Psychological Examination is removed, also. 

The derivation of the mathematical model 
assumes that all the distributions are normal 
and that the variances and covariances are ho- 
mogeneous. The homogeneity of the variances 
and covariances was tested by a method devel- 
oped by Bishop. 24 The original test, for which 
Bishop’s test is an = approximation 
was derived by Wilks 25 Within the limits of 
the test the variances and covariances of the 
various cells, when corrected for inequalities 
in general intelligence, were found to be homo- 
geneous. 

Table OI contains the total sums of squares 
of the six variables. The adjusted values of the 
two tests and their adjusted variance sums of 
squares are given in Table IV. The adjusted 
values are found by solving the following type 
matrix: 


Yi¥: Yio y¥ix \(y¥ix  Yex) 


Y2Y: Y2Ye2 % Y2x 


The adjusted values of the two tests form a 
2 x 2 determinant which is solved to obtain the 
sums of squares column. The tests of signifi- 
cance are performed using the sums of squares, 
rather than the mean squares as in the previous 
analysis of variance tables. 

For two characteristics and any number of 
treatments, F is distributed as 


1-/W n-m-1 
Tw ng 


with 2ng, 2(n ~ nq ~ 1) degrees of freedom. The 
tests of significance on the variable W which is 
equal to 


A 


A+B 


where n = the total degrees of freedom of treat- 
ment B, and the error A 

ny = the treatment degrees of freedom 

B = the sum of squares of any part of the 
analysis of variance table which is be- 
ing tested 

A = the error sum of squares 
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The F-ratio of 5. 44 which was obtained for 
the among treatments source of variation is sig- 
nificant at the 1 percent level, The observed dif- 
ferences would occur less than one time out of 
a hundred in the long run when the null hypothe- 
sis is true. Therefore, the null hypothesis is 
rejected, The educational conclusion is that the 
teaching methods do differ significantly in their 
effectiveness in developing in the student the abil- 
ity to use the scientific method, as it was de- 
fined for this study. 


Summary and Conclusions 





As all of our present-day knowledge of the 
natural sciences is due to experimentation, it 
seems natural that the students should have ex- 
periences designed to promote an understanding 
of the scientific method, That part ofthe con- 
cept—the scientific method—which seems to be 
appropriate for the laboratory was definedas the 
inductive-deductive method, Considering that 
aspect of the scientific method which is suitable 
for the laboratory, we agree with Kruglak’s26 
Statement that, ‘‘What better way is there to 
teach that the scientific method is nota super- 
highway than to place the student in a situation 
where he will experience the same failures, 
make the same mistakes, suffer the same acci- 
dents, and explore the same blind alleys as the 
research scientist. ’’ 

The tests were constructed to measure as~ 
pects of problem-solving. The Interpretation of 
Data Test was composed of items which meas-~- 
ured various factors related to the interpreta- 


tion of data, 
One of these factors was the use of controls. 


The inductive-deductive method differed {rom 
the other methods in the manner in which the 
need for controls was introduced, In the induc- 
tive-deductive method the controls were intro- 
duced because some of the data were difficult 
to interpret without the control, Items about 
errors of measurement and significant figures 
were included in the Interpretation of Data Test. 
The students in the inductive-deductive method 
were taught the need of significant figures and 
errors of measurement by placing them ina sit~ 
uation where they had to draw a conclusion from 
data which were too inaccurate for the purpose. 
Items about graphs and data and their interpre- 
tation were included, also, Items of varying 
complexity were used, The items ranged indif- 
ficulty from items which measured the a bi lity 
of a student to recognize a direct proportional- 
ity to items which required the student to find 
the numerical value of the proportionality con- 
stant, 

In the inductive-deductive method thé stu- 
dents were taught to plot the data ona graph. 
By a process of trial and error or from the in- 
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terpretation of the resulting curve the students 
were to manipulate the variables until they ob- 
tained a straight line. From the straight line 
graph the students were to write an equation de- 
scribing the relationship, This relationship, 
with the constant of proportionality evaluated, 
became the empirical law, Other items meas- 
ured the deduction from the discovered law to 
an applied problem-solving situation. Controls, 
errors of measurement and significant figures 
and graphs were always emphasized and ex- 
plained carefully to the students in the other 
methods. 

The experimental study showed that, once 
the data have been collected, the data were in- 
terpreted equally well by the students regard- 
less of which method was used to teach them. 
The conclusion is based upon the acceptance of 
the null hypothesis. It should be pointed out that 
the probability of obtaining the observed differ- 
ences was in the 5 to 10 percent range for the 
randomized block analysis when the inequalities 
due to general intelligence were removed by co- 
variance analysis. That is, the observed differ- 
ences would occur five to ten times out of a hun- 
dred in the long run when the null hypothesis is 
true, Therefore, the trend in favor of the induc- 
tive-deductive method approaches significance. 

The Design an Experiment Test was construct- 
ed to measure the ability of the student to reor- 
ganize the facts they knew, to delimit the prob- 
lem, to isolate the important factors of the prob- 
lem, and to design an experiment for solving 
the problem. One important factor from the 
preceding discussion could be included in the 
scoring of the items: that factor was the use of 
controls, The inductive-deductive laboratory 
classes were given a subject-matter centered 
problem, The students were to submit a report 
on how they would proceed to solve the problem. 
The students were aided in their identification 
of the problem, and in their review of the facts 
and factors important to the problem, by a ser- 
ies of questions. Apparently, while they were 
formulating their plans, the students taught by 
the inductive-deductive method were required to 
think through the problem more effectively. At 
least, this study indicated that the inductive- 
deductive method was significantly superior to 
the other methods in promoting the ability to 
develop a line of attack, as measured by and 
within the limits of the constructed test. 

The Performance Test was designed to meas~- 
ure the total problem-<solving situation. Many 
factors can and have been identified within the 
total problem~solving situation. To arrive at 
the ‘‘correct’’ solution the student would have to 
consider the problem, identify the situation, re- 
view what is known, and formulate plans for the 
solution of the problem. While conducting the 
experiment the student must consider the use 
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of controls for obtaining unequivocal data. The 
student must do the experiment carefully using 
his knowledge of errors of measurement and sig- 
nificant figures. The scientific attitudes, al- 
though not measured directly, play an important 
role, It is assumed that if a wide range of prob- 
lems are used the student must consider and use 
the above factors if he discovers a relationship 
which to him is new. 

The null hypothesis for the Performance Test 
was accepted, The differences in favor of the 
inductive-deductive method were not statistically 
significant. The comparison of the F values with 
tabled points of the F distribution indicated that 
when the inequalities due to general intelligence 
were corrected among the groups, the F-ratio 
fell just short of the 5 percent point, however. 
There is a definite trend in favor of the inductive- 
deductive method just as there was for the Inter- 
pretation of Data Test, 

In all the separate analysis of the tests the F- 
ratio approached or exceeded the 5 percent level 
of significance, It is not surprising, therefore, 
to find that the combined vector analysis of the 
Written and Performance Tests should show a 
significant difference among methods. The sig- 
nificant difference is based upon a randomized 
block analysis. Part of the difference may be 
due to intra-block error variation which cannot 
be estimated by the among replications sum of 
squares. This variation may inflate the treat- 
ment effects. Therefore, the complete validity 
of the results might be questioned. We assume 
that this component of the intra-block variation 
is small, If this assumption is reasonable, the 
inductive-deductive method was successful in de- 
veloping problem-solving abilities in subject- 
matter centered problems. According to this 
study, the ability of the student to formulate 
plans and to design an experiment appears to be 
the crucial aspects involved, The individualized 
inductive-deductive laboratory appears to offer 
a means of developing the student’s ability to use 
the scientific method which is defined in terms 
of the context of discovery and in the context of 
this study. 
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EMPIRICAL COMPARISON OF SIX CESSATION 
TESTS FOR USE IN PRINCIPAL COMPON. 
ENTS FACTOR ANALYSIS 


JOHN E. STECKLEIN 
University of Minnesota 


THE DETERMINATION of the ‘‘best stopping 
place’’ in extracting factors has long beenaper- 
plexing problem for factor analysts. In keeping 
with the scientific ideal of parsimony in re- 
search, the factor analyst desires to take out 
as many factors as may be necessary to de- 
scribe a given set of variables, but he does not 
want to extract too many or too few factors. If 
he takes out more factors than are necessary, 
he spends valuable research time producing fig- 
ures which will add little to his understanding of 
the variables being studied; if he extracts too 
few factors, he is likely to miss an important 
part of the factor constitution of the variables. 
Hence a cessation test is very important in fac- 
tor analysis. 

A test of significance of factors should have 
certain desirable characteristics. The test 
should be one which can be appiied while the 
factoring process progresses—it should be se- 
quential. [ Although if the complete factoriza- 
tion of a large matrix can be accomplished ina 
few short hours by the new computers, as indi- 
cated by Wrigley and Neuhaus (14), sequential- 
ity is less necessary.| The test shouldbe such 
that its application is relatively simple; compu- 
tational requirements should not be laborious or 
too involved. It seems logical that a suitable 
criterion will be some function of the size of the 
sample. It may also be expected to be a func- 
tion of the number of variables. As far as pos- 
sible, the estimator should be consistent and 
efficient. It is of some advantage if the criter- 
ion is also unbiased, but this is not crucial if 
the bias is recognized. The above properties 
of a model cessation test should be kept in mind 
while comparing the results of the tests de - 
scribed in the following pages. 

Many tests have been devised to measure 
the significance of a given factor, or of a given 
number of factors. Investigators have usedfor 
criteria residuals, sample size, number of var - 
iables, number of sign changes, reliabilities, 
factor variances, etc. Thomson discusses many 
tests in the latest edition of his book on factor 
analysis (12). Burt, in a recent article (2), an- 
alyzes the different tests that have been suggest - 
ed and describes problems involved with each. 
Some cessation tests provide liberal interpreta- 
tions, resulting in the extraction of more fac - 
tors than are later found to be necessary or in- 





terpretable. Others are too conservative, result- 
ing in a loss of information. Previous investigat- 
ors (11,13) have compared different tests used to 
determine when to stop factoring when using the 
centroid method, but to the author’s knowledge, 
no previous empirical comparison has been 
made of cessation tests specifically applicable to 
principal components factor analysis. 

The present study has been restricted to a 
comparison of six cessation tests: an adaptation 
of Kelley's test (8) to the Hotelling iterative meth- 
od of factoring (7); the respective tests invented 
by Bartlett (1) and Hoe! (6); the test devised by 
Guilford and Lacey (5); Burt's formula (3); and 
the test proposed by McNemar (10). Several oth- 
er tests were considered for inclusion in this 
study, but they were discarded either because 
they were applicable only to the centroid method 
of factoring, or because their use appeared to 
be too laborious or non-sequential. . 


Data Used 


Data used in these comparisons were correl- 
ation matrices obtained from the administration 
of 10 tests of word fluency and vocabulary to 150 
boys and 166 girls in the sixth grade and to 85 
boys and 138 girls in the ninth grade in selected 
rural, town, and city schools in Wisconsin. The 
ten tests were: (9) Vocabulary B—selection 
from four possibilities of the synonym for a 
given word; (20) Vocabulary A—definition of a 
word by use in a sentence; (22) First and Last 
Letters; (23) Suffixes; (24) Adjectives; (25) Things 
Round; (26) Synonyms 1—free recall of a syno- 
nym for a given word; (27) Synonyms 2—free 
recall of a second synonym for the given word; 
(28) Synonyms 3—free recall of a third synonym 
for the word; (29) Letter-Star. These tests have 
been used frequently in studies of word fluency. 

A 10 « 10 correlation matrix was computed for 
each of the four groups of students. Eachof the 
five cessation tests was applied during the fac - 
tor analysis of each of the correlation matrices. 
The sixth test was applied to only one set of 
data. Results were then compared. Four fac- 
tors were extracted and interpreted for each of 
the four groups. The correlation matrices are 
shown in Table I. The principal-axis solutions 
are listed in Table I. 
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TABLE I 


INTERCORRELATIONS OF WORD FLUENCY 
AND VOCABULARY TESTS, FOUR STUDENT GROUPS* 




















rests 9 20 22 23 24 25 26 27 28 29 
9 1. 000 626 327 367 325 214 496 413 408 308 
20 626 1.000 $11 305 365 225 439 505 481 363 
22 327 311 1,000 515 365 367 343 402 338 383 
23 467 305 515 1.000 426 419 401 377 365 429 
24 325 365 465 426 1.000 503 424 346 236 434 
25 214 225 367 419 503 1.000 380 397 190 481 
Zt 496 439 343 401 424 380 1.000 535 380 323 
27 413 505 402 377 346 397 535 1.000 612 397 
26 408 481 338 365 236 190 380 612 1.000 223 
29 308 $63 383 429 434 481 323 397 223 1.000 
ixth Grade Girls (N = 166) 
Tests 9 20 22 23 24 25 26 27 28 29 
9g 1. 000 675 339 394 281 171 427 381 270 204 
20 675 1.000 352 363 179 162 376 32) 276 279 
22 439 352 1.000 584 387 369 367 42) 494 482 
23 394 363 584 1.000 409 410 408 487 442 437 
24 281 179 387 409 1.000 581 456 491 444 541 
25 171 162 369 410 581 1. 000 417 441 421 474 
26 27 376 367 408 456 417 1.000 493 334 380 
27 $81 321 421 487 491 441 493 1.000 684 490 
28 270 276 494 442 444 421 334 684 1.000 419 
29 204 279 482 437 541 474 380 490 419 1.000 





* Decimal points have been omitted (Continued) 
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TABLE I (Continued) 





Ninth Grade Boys (N = 85) 


Tests 9 20 22 





000 
624 
253 
44) 
483 
285 
404 
508 
336 
294 


Ninth Grade Girls 





Tests 9 20 ) 28 





9 1. 000 343 
20 601 342 
22 064 236 
23 298 302 
24 387 478 
25 055 173 
26 428 316 
27 424 601 
28 343 000 
29 291 263 
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TABLE Il 


PRINCIPA\-AXIS SOLUTIONS (F MATRICES) FOR WORD FLUENCY 
AND VOCABULARY CORRELATIONS, FOUR STUDENT GROUPS * 

















jixth Grade Boys sixth Grade Girls 

Test Fy Fy Fag Poy Test Fy Fy Foy Fry 

9 673 -391 +295 +362 9 581 »76 142 066 
2 695 -420 -280 -14 20 554 691 038 -037 
22 644 203 453 -~339 22 708 O10 -410 -364 
23 685 255 308 -299 23 729 054 -262 -297 
24 655 357 -~308 020 24 708 -338 332 «= - 031 
25 613 532 -U98 315 25 658 -406 298 -094 
26 7i1 i119 «192 170 26 680 L116 438 71 
27 752 -2256 Isl 415 27 777 -U99 115 460 
28 633 455 $8u 237 28 714 -181 -378 442 
29 64) 367 ~135 -O26 29 98 -274 O21 -257 


Variance 








4.507 1.257 0.802 O 715 4.675 1.348 0.800 0.714 
Ninth Grade Boys Ninth G ade Gir!s 

Test Fy Fy Fu Fy fest Fy Fo Fm Fry 
9 696 065 -211 -45' 9 631 “462 -225 -285 
20 765 «037 -315 -258 20 730 -312 -094 -344 
22 508 +-405 -349 589 22 444 650 -365 101 
23 728 -390 242 031 23 639 474 ~ 346 -109 
24 736 274 - 068 i33 2 723 050 289 -046 
25 442 624 409 202 25 457 420 689 008 
Zé 634 244 -297 ~-078 26 661 -308 205 024 
27 839 -207 197 087 27 765 -234 012 369 
28 690 -273 499 -07i 28 625 -136 -149 631 
29 662 $06 «6-097 265 29 676 247 026 -307 


Variance: 
4.615 1.057 0.881 O 773 4.138 1.366 0.936 0. 853 








*Decimai points have been omitted 
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An Adaptation of the Kelley Variance- 
Ratio Test 





The Kelley variance-ratio test of significance 
of the principal-axis components as they exist 
at any stage of the Kelley factoring process was 
developed specifically for use in a study made 
by Frederick B. Davis (4). The Kelley iterative 
method (9) employs a series of rotations between 
selected variables to reduce the variance-covar- 
iance matrix to a diagonal matrix. However, if 
the scores are transformed into standard meas- 
ures, the Kelley method is equally applicable to 
a matrix of intercorrelations, with unities in the 
diagonal —the type of matrix used in the present 
investigation. When all off-diagonal values are 
reduced to zero, or to values not statistically 
significant, the resulting variables are, within 
the limits of chance, the final principal-axis 
components. The variance-ratio test may then 
be applied to the variances of any two consecu- 
tive components to determine the significance of 
the earliest obtained component. The test is: 


Vo / (N-1-g) 
r a 


N-1-g,N-1-h vy. / (N-1-h) 
b 


in which g is the number of rotations necessary 
to reach the variable that is Component a, h is 
the number of rotations necessary to reach the 
variable that is Component b, Vc, is the vari- 
ance contribution by Component a, Vc, is the 
variance contribution by Component b, and N is 
the number in the sample. The table of the dis- 
tribution of F is used to determine significance, 
using N-1-g and N-1-h degrees of freedom. 

To the writer’s knowledge, no investigation 
has been made into the relationship between the 
number of rotations necessary in Kelley’s meth- 
od of producing a component and the number of 
iterations necessary in Hotelling’s method of 
producing a component. This relationship 
provides a worthwhile subject for future investi- 
gation. However, since the sample sizes inthis 
study were 166, 150, 138, and 85, it seemed 
likely that, in all except the last case, the num- 
ber of rotations would not greatly exceed the 
number of iterations, if atall. Since the maxi- 
mum number of iterations required to extract 
a factor never exceeded 60 and was usually much 
less than 50, it seemed reasonable to assume 
that the degrees of freedom would exceed 100 in 
all except the last case (9th grade boys). In study- 
ing the F table, it was noted that, for degrees of 
freedom exceeding 100 and less than 200, the 
variation in critical values of F is very slight. 
For all practical purposes, then, the variance- 
ratio test could be applied with 100 deg rees of 
freedom for each variance computed in the three 





STECKLEIN 169 


larger sets of data, to give a measure of signif- 
icance of factors. Furthermore, since the F 
values decrease with increase in sample size, 
this would give a conservative estimate for the 
critical value of F. Obviously, such thinking 
would not hold for the group with sample size of 
85. For this sample, a decrease roughly pro- 
portionate to that assumed for the other samples 
was arbitrarily taken, and a value of 50 degrees 
of freedom was selected as a conservative esti - 
mate. With these assumptions the test reduces 
to 
F “Ca 
Ye, 

or simply the ratio of the variance contributions 
of two consecutive factors. This adaptation of 
the Kelley test was applied to the factors obtained 
for each of the four groups: sixth grade boys, 
sixth grade girls, ninth grade boys, and ninth 
grade girls. Results are shown in Table III. 

Two factors were found to be significant for 
the sixth grade girls, the sixth grade boys, and 
the ninth grade girls, but only one factor was 
found to be significant for the ninth grade boys. 
The latter finding demonstrates the effect of 
sample size upon this test. 


Bartlett's Significance Test 


Bartlett utilizes latent roots in his deve lop- 
ment of a? testof significatce. The determin- 
ant of the given correlation matrix R, | Rj = 
AyAgAy--- Ap, Where A, AzrAy.. Ap are the latent 
roots (in descending order of magnitude) of the 
correlation matrix of p variables. The total num- 
ber of degrees freedom available from the orig- 
inal observations is n, where n= N-1, and N = 
the total number of persons tested. The statis- 
tical significance can be determined for the en 
tire correlation structure, but the present con- 
cern is with the statistical significance of the 
residual roots, i.e., those roots left after the 
removal of the largest root, the next largest, 
etc. Bartlett takes 


et ea 


| Ri 
oe We [Pm a--- 
p-k 


he | p-k 








p = number of latent roots (number of tests), 
and k = number of factors already extracted. 
This product gives an approximation to’ for 
the successive factors, with the number of de- 
grees of freedom determined by 4(p-k)(p-k-1). 
In the principal components method, the la- 
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TABLE Il 


SIGNIFICANT FACTORS DETERMINED BY THE 
KELLEY VARIANCE-RATIO TEST 





EE SS. 8 Oe ee ee 


t sriance accounted for by Vactor I 









































Vir : variance accounted for by Factor I 

Vin * variance accounted fo: by Factor Ll 

Viv * variance accounted for by Factor IV 
Sixth Grade Boys (N ® 150) Sixth Grade Girls (N = 1606) 
\ ‘ 07 5 
ad 8.2.4 s 3 586* vs : $. 7 s 3. 466° 
Vil 1.257 Vil 1. 348 ; 
Vil F send °e i.567°* ‘u . +. ee s 1.672°° 

il! 602 Vil 806 

Matt : bue » | 122°%** Vin » 606 » 1.129*%** 
VIV 715 Viv 714 
Ninth Grade Boys (N = 85) Ninth Grade Girls (N = 138) 
Vv 4.615 V 4.138 
as i T'—-_ = 4 462° 5 _ = 3 029* 
Vil 1. 058 Vil 1. 366 
lg ake o 8.20nem Sh: 6 tage 01 heaeerr 
Vill 881 Vin 936 
Vin . 868i . 1 440%** . Vi ‘i 936 . | 097*** 
VIV 773 Viv 853 








* Indicates that the lowest factor involved is significant at the one per- 
cent level of significance. 

** Indicates that the lowest factor involved is significant at the five per- 
cent level of significance. 
(F = 1.39 at .05 level and F = 1.59 at .01 level for 100 degrees of 
freedom. ) 

***Indicates that lowest factor involved is not significant. 
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tent roots are the sums of the squares of the fac - 
tor weightings for each factor. The firstlatent 
root is also the variance contribution of the first 
factor toward the total variance, the second la - 
tent root is the variance contribution of the sec- 
ond factor toward the total variance, etc. Hence 
Bartlett’s test appears to be particularly suit- 
able for inclusion in a comparison of signifi- 
cance tests using principal-axis solutions. 

However, upon further scrutiny, it is appar- 
ent that Bartlett's test is impractical for gen- 
eral use with a large number of variables be - 
cause it is necessary to find all of the charac- 
teristic roots for a given matrix before apply- 
ing the test. To put it another way, Bartlett's 
relationship requires the determination of the 
value of the matrix determinant, before any of 
the individual factors can be tested for signifi- 
cance. This precludes any sequentiality of the 
testing process, an important labor-saving 
characteristic with ordinary computational pro- 
cedures. 

Despite the lengthy and laborious process, 
one of the four matrices (6th grade boys) was 
completely factored and all 10 latent roots were 
determined, in order to compare the findings of 
Bartlett’s test with the others. Computations 
are shown in Table IV. 

The test found the first three factors to be 
significant at the 1% level, and the fourth factor 
significant at the 5% level. No other factors 
were significant. These results are consider- 
ably more liberal than those given by the adap- 
tation of Kelley’s method. With the high-speed 
factorization promised by newer machines, all 
the latent roots can be easily computed and Bart- 
lett’s test appears to offer an excellent measure 
of significance of factors. 


Hoel developed a significance test which 
‘*hinges upon an inequality found in studying 
properties of the common factor correlation de- 
terminant | rj" | .”"’ Asa result, the inequality 


Ky + Hm < 


is used to test for the significance of the factors. 
is the q-th largest characteristic root—it 
should be noted that Hoel starts with the small- 
est value and counts upward to obtain his q-th 
largest value—of the correlation matrix with di- 
agonal elements unity, Hin is the smallest of the 
communalities, N is the sample size, n is the 
number of tests, and n-q is the postulated mini- 
mum essential number of factors. Since the 
method is depenient upon the assumption of 
large samples and approcimate normal distribu- 
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tions in the basic variables, the five percent le\ 

el from the normal probability curve gives odds 

of 19 to 1 against the truth of the postulated min 

imum essential number of factors if the inequal- 
ity is not satisfied. 

Table V shows the results of the application 
of the Hoel test to the factorization of the four 
correlation matrices. It can be seen that for 
the sixth grade boys’ matrix, two factors are ap- 
parently sufficient, but to be on the safe side it 
would seem advisable to consider three factors 
For the sixth grade girls and ninth grade girls 
matrices, two factors seem to be adequate, al 
though to be on the safe side one might investi 
gate a third factor. The effect of the smallness 
of sample size for the ninth grade boys is strong 
ly indicated in this test. Obviously, the value ol 
the right half of the inequality is decreased with 
increase in sample size. Consequently, the five 
percent level of acceptance is higher for this . 
small sample than for the other samples. The 
result of the application of the test to the matrix 
of the ninth grade boys is that one factor is ap 
parently sufficient to represent the set of data 
for this group. However, the inequality isclose 
enough to warrant the precautionary extraction 
of an additional factor. 

Hoel’s test duplicates exactly the results 
found by Kelley’s test for each set of data. How- 
ever, Hoel recommends the extraction of one 
more factor if any uncertainty exists. 


The Guilford-Lacey Test 


Guilford and Lacey (5) suggested that the fac 
tor loadings be used as the criteria for determ- 
ining cessation of factoring. They used thede- 
crease in factor loadings, rather than the de- 
crease in residual values, to determine the num- 
ber of factors. They also took into account the 
sample size by setting up the following test cri 
terion: If the product of the two highest factor 
loadings for a given factor falls below 1 IN, 
where N = sample size, then the number of fac- 

‘rs previously obtained is sufficient. The ex- 
; +ssion 1//Nean be recognized as the stand- 
avd error of a zero correlation for a large sam 
ple, say N greater than 100. 

Results of the application of this test to the 
present data are shown in Table VI. It appears 
that the Guilford-Lacey test is much more liber 
al than the two previously applied tests because, 
for all four groups, application of the test showed 
that three, four, and five factors were insuffi- 
cient. Perhaps this criterion would have yield 
ed results more comparable to those obtained by 
the other two tests if estimated communalities 
rather than unities had been used in the diagon- 
als. When the test was applied to the complete 
ly factored sixth grade boys’ matrix, it was 
found that nine factors were considered signifi- 
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TABLE IV 


SIGNIFICANT FACTORS DETERMINED BY BARTLETT'S TEST 
(SIXTH GRADE BOYS, N = 150 ) 








No. of Factors 





120. 34* 
28 60.28* 
21 46.62* 


15 28. 04** 


Latent Roots 
507 a, 
257 . 

802 = .435 
*» . 715 = .330 
= .634 Aio*® .299 


A, Az A; Aigo? 023 





* Indicates that the factor is significant at the one percent level of sig- 


nificance. 
**Indicates that the factor is significant at the five percent level of sig- 


nificance. 
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TABLE V 


SIGNIFICANT FACTORS DETERMINED BY THE HOEL TEST 





Conclusion 








375 One factor insufficient 


Two factors sufficient 


Sixth Grade Girls (N = 166) 





307 1. 348 One factor insufficient 
Two factors sufficient 
714 


Ninth Grade Boys (N = 85) 





195 l 1.058 One factor sufficient 
881 


Ninth Grade Girls (N = 138) 





197 1. 366 b One factor insufficient 


936 133 Two factors sufficient 


853 050 





*F represents the hypothesized number of factors for minimum rank 
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TABLE VI 


IIGNIFICANT FACTOR DETERMINE! 
THe GUILFORD-LACEY TEST 








Product of the 
two highe st 


factor loadings Conclusion*® 





Three factors 


insufficient 


Five factors 


insufficient 


seven factors 


insufficient 


Nine factors 
sufficient 


Three factors 


insufficient 


Ninth Grade 
Boys * 16 d Three factors 
insufficient 


Ninth Grade 
Girls Three factors 
insufficient 





. 


If the product of the two highest factor loadings for a given factor 
falls below | AN, the factors extracted previous to the given factor 


are sullicient 
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cant. This number seems unrealistic for only 
10 variables. The similarity in factor loadings 
for the other three sets of data indicated that 
similar results would be received if factors be- 
yond the fourth were obtained and the test ap- 
plied. 


Burt's Empirical Formula 
P. E. Vernon tested some two dozen meth- 
ods of cessation determination, as applied to the 

centroid or simple summation method of factor 
analysis, using communalities (13). He con- 
cluded that the use of three methods in combin- 
ation was best: The Guilford-Lacey test, the 
Mosier sum of residuals test (11), and Burt’s 
empirical formula for the standard error of each 
factor loading. Vernon suggested that if agree- 
ment were not found using these three tests, the 
investigator should resort to McNemar’s test, 
and decide according to the results of the four 
tests, extracting an additional factor, if still in 
doubt. The application of Burt’s test will be con- 
sidered next. His formula is 


SEq = (1-d*)V/n , where d = factor loading 
JNin-S+1) N = number of per- 
sons 
number of tests 
3 = ordinal number 
of the factor 


If one-half of the loadings of a given factor fall 
below twice their standard errors as determined 
by this formula, the factor should be rejected. 

Table VII gives the results of the application 
of Burt’s formula to the correlation data for 
the four groups of students. The test accepts a 
fourth factor, but rejects a fifth factor for the 
sixth grade boys. For sixth grade girls, ninth 
grade boys, and ninth grade girls, the test ac- 
cepts a third factor but rejects a fourth factor. 
Burt’s formula accepts two more factors than 
the Kelley and Hoel test for each of the boys’ 
groups, and one more factor for each of the girls’ 
groups. 


McNemar’s C riterion for Number of Factors 


McNemar developed a criterion for determ- 
ining when to stop factoring from the similarity 
between a factorial residual and a partial correl- 
ation coefficient. By adjusting the standard de- 
viation of the residuals so as to approximate 
closely the standard deviation of the correspond- 
ing partials, McNemar suggests that one can uti- 
lize the knowledge we have concerning the stand- 
ard error of the partial correlation coefficient, 
and thereby establish the following test for sig- 
nificance of factors: Factoring continues until the 
adjusted standard deviation is equal to or less 
than the standard error of a zero cor relation 
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Expressed as an equation, this criterion is 0 
1/VN, where oy is the adjusted standard devia- 
tion, and N is the sample size. McNemar de - 
rives the approximate value of the standard dev- 
iation of the partial residuals to be 


where 0, is the standard deviation of the ordin 
ary residual after s factors have been extracted, 
and Mj, is the mean communality for s factors 
The test says, in effect, that when oy r¢ achesor 
falls below 1/vVN , the magnitudes of the residu 
als are such that chance sampling errors in the 
original intercorrelations may accountfor their 
departure from zero 

Application of this test to the four sets oi data 
indicated tour significant factors ior all group 
except the ninth grade boys. Here again sample 
size affected the results so that only three [ac 
tors were found to be significant. Results of the 
McNemar test (Table VIII) agree with the Bart 
lett and Burt tests for the sixth grade boys’ data, 
but present a more liberal estimate of signifi 
cant factors than the other tests for the other 
three sets of data 


Summary 


Tests of cessation, i.e., tests to determine 
when one should cease the extraction ol factors, 
as devised by oix different authors, were com 
pared in their application to a factor analysis of 
word fluency and vocabulary test data for each 
of four groups: sixth grade boys, sixth grade 
girls, ninth grade boys, and ninth grade girls 
Tests devised by Kelley (8), Bartlett (1), Hoel 
(6), Guilford and Lacey (5), Burt (3), and Me 
Nemar (10) were applied and results were com 
pared. Reasonable agreement was found between 
the Kelley, Hoel, McNemar, and Burt tests on 
all data, and Bartlett's test on the one set of data 
tested. The Guilford-Lacey test failed to specify 
a cessation point for any of the four sets of data 
The easiest test to use was the adaptation of the 
Kelley test, although it proved to be the most 
conservative of the tests studied. The Bartlett 
test was by far the most laborious since it re 
quired the complete factorization of the matrix 
before any factor could be tested for significance 
With the advances being made in electronic com 
puters, however, this test will be more useful 
and will provide a sound basis for testing for 
cessation of factoring. The McNemar test pro 
vided the most liberal interpretation of signifi 
cant factors. However, among the five tests 
compared (excluding the Guilford-Lacey test), 

a range of only two factors existed between the 
most liberal and the most conservative estima 
tion of the number of factors required to repre 
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TABLE VI 


SIGNIFICANT FACTORS DETERMINED BY BURT'S EMPIRICAL 
FORMULA FOR STANDARD ERROR 





d = factor loading 
N* number of persons 
n * number of tests 


. (l-d*)/n 
AN(n-541) 


SEq 


S « ordinal number of factor 





Sixth Grade Boys (N = 150) 





Hypothesis: Fy is significam 


Hypothesis: Fyy is significant 
ne#iQ »* 4 





d 


(2 x SE) 


Conclusion 





~ 62 
-149 
~339 
~299 
020 
$15 
170 
415 
237 
-028 


170 
190° 
172 
178 
196° 
176 
190° 
162 
164 
i94* 


Sixth Grade Girls (N = 166) 





Hypothesis: Fy) is significant 


n 


10 523 





d 


(2 x SE) 


Conclusion 





142 
038 
-410 
~262 
432 
298 
438 
-115 
-~378 
621 


170 
174 
144 
162 
154 
158 


140 
172° 
148 

174° 


S25 





(2 x SE) Conclusion 





230* 
218 
230* 


226* 
224* 
230* 
180 

230* 
230* 
164 


Hypothesis: Fry is significant 
S24 





d 


(2 x SE) Conclusion 





066 
-037 
-~ 364 
-297 
-031 
-094 

071 

460 

442 
-257 


184* 
186" 
162 
170 
186" 
184* 


184* 
146 
150 
174 





* Five or more asterisks (*) results in a rejection of the factor. 


(Decimal points have been omitted. ) 


(Continued) 
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TABLE VII (Continued) 





d # factor loading 
(1-d*)/n N*® number of persons 
n® number of tests 


AN(n-S41) S = ordinal number of factor 


Ninth Grade Boys (N = 85) 


5Eg 








Hypothesis: Fy; is significant Hypothesis: Fyy is significant 
n= itl S253 n= 10 524 








d (2 x SEZ) Conclusion d (2 x sE) Conclusion 








-211 232* 204 
-315 218 242 
-349 213 : 169 
242 228 259* 
- 066 241* 254* 
409 202 Accept 248* 
-297 221 257* 
197 233* 257* 
499 182 258* 
-097 240* 241 


Ninth Grade Girls (N = 138) 





Hypothesis: Fy; is significant Hypothesis: Fyy is significant 
n= 10 5 = 3 n= 10 524 








d (2 x SE) Conclusion d (2 x SE) onclusion 








-225 180 186 
-094 188* 180 
~ 365 164 202* 
-346 168 202* Reject 
289 174 204* 
689 100 204* 
205 182 204* 
612 90* 176 
-~149 186* 122 
026 190 184 





Note: Calculations show that if a given factor is accepted all previous 
factors are also accepted. Hence only the results of the application of 
the test to the last accepted factor and to the immediately subsequent 
rejected factor are shown in the table 

* Indicates that the factor loading is less than twice its standard error 
Five or more asterisks (*)results in a rejection of the factor. (Decimal! 
points have been omitted. ) 
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TABLE Vill 


SIGNIFICANT FACTORS DETERMINED BY 
McNEMAR'S CRITERION * 








Group 1-M,2 





Sixth Grade Boys 
(N = 150) . 549 


712 
781 
818 
842 


Sixth Grade Girls 
(N = 166) 


Ninth Grade Boys 
(N = 85) 


Ninth Grade Girls 
(N = 138) 


2 082 
; 075 785 . 096 


4 069 . 818 . 084 





* If the value of 0, is less than or equal to the value of 1/ JN, factoring 
should cease 


**Indicates that the factor is not significant 





March, 1956) 


STECKLEIN 


TABLE 1X 


SUMMARY OF CESSATION TEST RESULTS 








Number of Significant Factors Indicated by 


Kelley Bartlett Hoel Jurt McNemar  Guilford- 
Group Test Test Test Test Test Lacey Test 








Sixth Grade 


Boys 
(N = 150) 


Girls 
(N = 166) 


Ninth Grade 


Boys 
(N = 85) 


Girls 
(N = 138) 





*Dash indicates that the number of iactors was not determined. 
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sent the given set of data. A summary of the 
test results will be found in Table IX. 

The results of this study do not permita 
broad generalization concerning the use of these 
cessation tests. Further study of the adaptation 
of the Kelley-ratio test, especially with larger 
oamples, seems warranted. All but one test des- 
ignated four or less significant factors for each 
of the four sets of data. Three of the six tests 
specified three or four significant factors. Psy- 
chological interpretations of word fluency com- 
ponents in this and other factor analyses of word 
fluency, however, have resulted in the identifi- 
cation of only four factors. With the data used 
in this study, therefore, and using the principal 
components method of analysis, Bartlett’s, 
Burt’s and McNemar’s tests seem to designate 
the number of significant factors most inagree- 
ment with the present limits of psychological in- 
terpretability. Of these three tests, McNemar'’s 
seems to be the easiest to apply. 
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Abstract 


The effects of sampling error were studied 
empirically with respect to four methods of 
equating scales of tests administered to non- 
overlapping groups of subjects: (1) mean and 
sigma method, (2) equi-percentile method, (3) 
madimum likelihood method using an ‘‘anchor’’ 
test, (4) standard reference group method using 
an ‘‘anchor’’ test. The methods were compared 
under both random and stratified sampling. Re- 
sults showed that sampling error was (1) small- 
er for those methods which make use of an ‘‘an- 
chor’’ test than for those which do not, (2) small- 
er for equated scores closer to the mean of the 
total population than for those further from the 
mean, (3) was not decreased by stratification by 
institution when the ‘‘anchor’’ test methods of 
equating were used. 


I. Introduction 


IN ANY TESTING program in which there is 
more than one form of a test, the problem arises 
of equating scores on the several forms of the 
test so that an examinee’s reported score would 
be independent of the form of the test he happens 
to have taken. 

The problem of equating scores on parallel 
forms of a test is analogous to the problem of de- 
termining the specific one-to-one relation be- 
tween temperatures measured on the Fahrenheit 
scale and temperatures measured on the Centi- 
grade scale. When such a one-to-one relation 
between scores on the two scales has been de- 
termined, scores may be transformed at will 
from scale to scaie with no loss of information, 
and it is a matter of complete indifference in 
terms of which scale the original observations 
were made. 

With scales as relatively reliable as temper- 
ature, no confusion is likely toarise with respect 
to the logic of equating. The relatively large 
amount of error variance involved in psycholog- 
ical tests, on the other hand, may raise some 
question as to whether this free transformability 
from scale to scale is desirable, and whether 
simple linear regression, which minimizes ina 
least squares sense errors of prediction of 





scores on Form B from scores on Form A, 
would not be a more desirable solution. But such 
a procedure would equate the tests for no purpose 
other than predicting scores on Form B. These 
predicted scores would necessarily be reduced 
in variance as compared with scores originally 
observed on Form: B, thus requiring a different 
regression equation for the prediction of any out- 
side variable from that required by the observed 
scores on Form B. Properly equated scores, on 
the other hand, (provided that the forms of the 
test are equally reliable) may be used inter - 
changeably in all formulas for the prediction ol 
outside criteria. 

To define the problem more precisely, two 
tests are said to be comparable for a particular 
population when the distribution of true scores 
on the two tests are identical for that population 
When the true scores of each individual on the 
two tests are identical, the two tests will be 
said to be measuring the same function and the 
tests will be said to be equated, inasmuch as 
they would be comparable for any population. 
For the case of equally reliable tests, identity of 
the true score distributions implies identity of 
the observed score distributions. This is the 
case with which we shall be concerned. This and 
related problems have been discussed at length 
by Angoff (2,3), Lord (8,9), Gulliksen (5), and 
Flanagan (4). 

In many situations it is not advisable (for ex- 
ample, because of uncontrollable practice effects 
or the physical length of the tests), or even pos - 
sible, to administer both tests to the »ame group 
of subjects. It is still possible to equate test» 
administered to different groups of subjects un- 
der the assumption that both groups are samples 
from the same population. Four methods for ac- 
complishing this end are: 1. mean and sigma 
method, 2. equi-percentile method, 3. maxi- 
mum likelihood method using an ‘‘anchor’’ test, 
4. standard reference group method using an 
‘‘anchor’’ test. This study is an empirical in- 
vestigation of the effects of the sampling errors 
involved in each of these methods. 


Il. The Experimental Procedure 


The data in this study were taken from an ex- 
perimental equating administration of the 1948 


“This work was carried on with the very helpful guidance and advice of Drs. William li. Angoff, Vrederic 
‘. Lord, and Ledyard 2 Tucker. 
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and 1949 College Editions of the ACE Psycho- 
logical Examination conducted in the summer of 
1949. In the administration, which was con- 
ducted with freshman college students, alternate 
students were given the 1948 form and the 1949 
form of the test. In addition, all students took 
a specially constructed short form of the test, 
Form XPEX. Approximately 1320 students par- 
ticipated in the equating administration, 660 tak- 
ing the 1948 form plus XPEX, and 660 taking 
the 1949 form plus XPEX. Sixty papers were 
removed at random from each of the two groups 
until the groups were reduced to 600 subjects 
each 

Each of the two sub-groups of 600 was then 
separated into six samples of 100, stratified 
according to school attended. All possible pair- 
ings of samples, one from each of the two sub- 
groups, were made, yielding 36 such pairs in 
all. Using any one of these pairs anda given 
method of equating scales, it was possible to de- 
rive equating parameters for the two tests (e.g., 
slope and intercept of a conversion line relating 
scores on the two tests). Since there were 36 
paire of samples, there were also 36 sets of 
equating parameters for a given methodof equat- 
ing, differing from each other only insofar as 
the method in question was affected by sampling 
error, 

Any specified score on the 1949 scale could 
then be transformed to the 1948 scale using each 
of the 36 sets of parameters, so that foragiven 
method of equating there were 36 transformed 
scores on the 1948 scale corresponding to a 
single score on the 1949 scale. The variance 
of these 36 transformed scores would then bea 
measure of the amount of sampling error in- 
volved and could be used to compare different 
methods ol equating. 

For this purpose, three score values were 
specified on the 1949 scale: the score corres- 
ponding to the mean of the total group of 600 who 
took the 1949 test, the score corresponding to 
the mean minus one standard deviation, and the 
score corresponding to the mean minus two 
standard deviations. 

The two groups of 600 (each of which had 
been separated into 6 samplesof 100, stratified 
by school attended) were then each re-formed, 
shuffled thoroughly, and separated again into 6 
samples of 100, this time on a random basis. 
Again, all possible pairs of samples, one from 
each group, were used, and, as with the strati- 
fied samples, 36 sets of equating parameters 
were derived by each method. The same three 
score values on the 1949 scale were then trans- 
formed to the 1948 scale using these pareme- 
ters, so that comparisons could be made be- 
tween random and stratified samples with re- 
spect to their sampling error. 

The means, standard deviations, and corre- 
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lation coefficients between ACE (1948 or 1949 
form) and XPEX are shown in Tables Ia and Ib 
for each of the sub-samples of 100 cases. 


"ll. The Methods of Equating 
The notation used here is as follows: 


= subscript denoting 1949 test taken by 
group a 
subscript denoting 1948 test taken by 
group / 
= subscript denoting XPEX test taken by 
group t (a + #8) 
subscript denoting a defined standard 
group 
= a score on the 1949 test 
a score on the 1948 test 
a transformed score on the 1948 scale 
corresponding to a given score X onthe 
1949 test 
the mean of a sample 
the mean of the population 
= the standard deviation of a sample 
the standard deviation of the population 
= mark indicating an estimate of a popu- 
lation parameter 


Method 1. Mean and Sigma Method 





This is the simplest method of equating and 
corresponds to the maximum likelihood solution 
for samples drawn from the same normal popu- 
lation where there are no common measures of 
both samples. It assumes that, for all practical 
purposes, the tests differ only in their first two mo- 
ments. Scores on one form are set equivalent to 
scores on another form if their respective stand- 
ard score deviates are equal. The transforma- 
tion equation is as follows: 


Y* =AX+B, (1) 


A = By,/8xq (2) 


B = My > AM, 


Method 2. Equi-Percentile Method 





This method defines two scores as equivalent 
if their percentile ranks are equal. This forces 
as nearly identical score distributions as possi- 
ble on the two forms and makes no assumptions 
about the relationship of the higher moments 
(i.e., the shape) of the distribution of scores on 
the two forms of the test. 

For each sample of 100 cases, cumulative fre- 
quency distributions were plotted on arithmetic 
probability paper, a smooth curve being draw” 
through the points representing the interval val- 
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TABLE IA 


MEANS, STANDARD DEVIATIONS, AND COEFFICIENTS OF CORRELATION 
OF THE SUB-SAMPLES OF 100 CASES (1948 ACE) 





Stratified Samples 


1948 ACE XPEX 
Standard Standard Correlation 
Deviation Mean Deviation ACE vs. XPEX 











32. 59 34.75 10. 09 
31.35 34. 48 10. 57 
34. 44 34. 00 11, 22 
34.73 33. 88 10. 83 
34. 61 34.15 10. 58 
33. 63 34. 90 








Random Samples 





1948 ACE XPEX 


Standard Standard Correlation 
Deviation Deviation 











32.05 10. 87 
33. 53 11. 35 
32. 97 10. 43 
31. 46 9.97 
35. 96 11.07 
34. 38 10.78 
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TABLE IB 


MEANS, STANDARD DEVIATIONS, AND COEFFICIENTS OF CORRELATION 
OF THE SUB-SAMPLES OF 100 CASES (1949 ACE) 





Stratified Samples 





1949 ACE 





Standard Correlation 
Deviation Deviation ACE vs. XPEX 





33. 08 : 11.12 
34. 64 ; 11.13 
30. 37 , 10. 87 
33. 21 33. 31 10, 42 
31.01 33. 22 10.19 
36. 70 33. 16 10. 53 





Random Samples 





1949 ACE XPEX 








~ Standard Standard Correlation 
Deviation Deviation ACE vs. XPEX 





33.19 10. 84 
26. 57 9.01 
39.78 11. 52 
34.07 11. 10 
33. 33 10.75 
31.24 10. 83 
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ues as a way of interpolating graphically. dy = Sy, * Py, (§, - S¥q) (10) 


Scores on one form were set equivalent to pe l 
scores on the other form if their percentile Aly, = (My, +My), 
ranks were equal. This method is referred to i B 

h 2a - - Se «os _— 2 
= Method , the direct equi-percentile meth re 2 Va Sy, + My 


(11) 


Each curve was then evaluated at every third + My )- Aye » (12) 
percentile and for each pair of samples these 33 8 
1948 ACE scores were plotted against the cor- DXV_ > byv, are the usual regression co- 
responding 1949 ACE scores. A straight line efficients, 
was fitted to these 33 points by computing the 
mean and variance of the observed scores cor- and Na = Ng - 
responding to the 33 percentile points for each 
sample. These means and variances were then Method 4. Standard Reference Group Method 
substituted in equations (1), (2), and (3) to yield Using An ‘‘Anchor’’ Test 
a best fitting straight-line transformation. This 
is referred to in the present paper as Method 2b, Like Method 3 this method uses a third test 
the straight-line approximation to equi-percent- (XPEX) administered to both groups to compen- 
iles. This represents a crude approximation to sate for differences between the two groups, and 
Method 1, and is affected by the same assump- it assumes that the two tests differ, for all prac- 
tions. tical purposes, in only the first two moments. 
Estimations are made for a group defined as 
Method 3. Maximum Likelihood Method Using **standard’’ with a given mean and standard de- 
An ‘‘Anchor” Test viation on one of the tests. It was derived under 
Tucker's assumptions of constancy of the slope 
This method makes use of a third test (in and intercept of the regression lines and of the 
this case, the short form XPEX) administered variance of errors of estimate in the two groups. 
to both groups to compensate for differences be- The transformation equations are as follows: 
tween the two groups. Like Method 1, itassumes 
that, for all practical purposes, the tests differ Y*=AX+B, (13) 
only in their first two moments. The method 
was derived by Lord (9) as a maximum likeli- A = dy /Pxg Z (14) 
hood solution under the assumption of bivariate x Qe 
normality of the total population. These same B = Aly, ~ Aux. ; (15) 
equations had previously been derived by Tuck- > 8 B 
er under a different set of assumptions, namely: Aye * My, + byy (Hyg ~ My) , (16) 














1) Constancy of the slope and intercept of the og = 8) +biy (or - Sy), (17) 
regression lines, x on v and y on v for a and p. "e B “Ip 16 "Ys 

2) Constancy of the variance of errors of es- + xv (Ay. - My) , (18) 
timate about the regression lines from sample - + B 
to sample. Oxg = 8 * Davee We - 84 (19) 


If the ‘‘anchor’’ test is uncorrelated with the ayy > *yg are given!®. vy, . 


two tests to be equated, this method reduces to 
Method 1. Dxy_ are the usual regression coe ffi - 


The transformation equations are as follows: 
cients; andNg = Ng ; 
Y* =AX+B, (4) 


» rt As was stated earlier, for each method there 

B = Aly, ~ Ady, ; (6) were 36 sets of equating parameters based on 
sf t ” the stratified samples, and another 36 sets of 
where Ax, = Mx. + xv (Ary, ~ My,); (7) equating parameters based on the random sam - 
ples. Thus, for a given 1949 score and given 

% 2 sx. +biy (63-8) ), (8) method of equating there were 36 transformed 

= _" = scores on the 1948 scale based on the stratified 

and ay, = My, + byv, (ay, - My) , samples, and another 36 values based on the ran- 


IV. Results 





411 footnotes will be found at end of article. 
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TABLE I 


MEANS AND STANDARD DEVIATIONS OF EQUATED SCORES OBTAINED WITH EACH 
METHOD, AND FOR RANDOM AND STKATIFIED SAMPLES 





Stratified Sampling Random Sampling 





At the Mean 





Y* Sy+ Y* Sys 





: 2. 67 117.85 
119. 4.77 118.47 
117. 2.73 117. 88 
115. 1.89 115. 37 

1. 55 111. 22 





At the Mean Minus One Standard Deviation 





83.75 
83.53 
83. 60 
81.70 
82. 63 





At the Mean Minus Two Standard Deviations 





08 7.15 49. 65 11. 
08)2 (10. 30)2 (46. 50)2 (10. 
26 7.25 49. 32 12. 
18 6.41 48.03 5. 
18 5. 21 54. 03 4. 





Note: Each pair of values Y* and Sys in this table is based on 36 observations. 
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TABLE Il 


RESULTS AT THE MEAN AND AT THE MEAN MINUS TWO STANDARD DEVIATIONS 
FOR SIX INDEPENDENT PAIRS OF SAMPLES FOR SIGNIFICANCE TESTS 





At the Mean 





Stratified 
Methods Ye Sys 











117. 11. 30 117. 86 
119. 29. 80 117.83 
117. 11. 87 117. 90 
115. 5. 65 115. 40 

3. 76 110.76 





At the Mean Minus Two Standard Deviations® 





82. 34 49. 60 
87. 58 49. 26 
60.15 48.12 
37.02 54.17 





TABLE IV 


THEORETICAL AND OBSERVED VARIANCES OF TRANSFORMED SCORES 
FOR METHODS 1 AND 3 





Theoretical Observed 
Variance Variance 





Method 1 


At the Mean 
At the Mean -1 Standard Deviation 
At the Mean -2 Standard Deviations 


Method 3 
At the Mean 


At the Mean -1 Standard Deviation 
At the Mean -2 Standard Deviations 





*Significant at the 5% level. 
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dom samples. The standard deviation of each 
group of 36 transformed scores was thena meas- 
ure of the sampling error involved. Three val- 
ues on the 1949 scale were chosen: the mean of 
the total group of 600 who took the 1949 edition, 
the mean minus one standard deviation, and the 
mean minus two standard deviations. Each of 
these values was transformed to the 1948 scale 
by every one of the transformation equations, 
and the mean and standard deviation of each 
group of 36 equated scores was computed, each 
group of 36 corresponding toa given 1949 edition 
score equated by a given method, and coming from 
either random or stratifiedgroups. The results 
are shown in Table I. 

From this table, it is clear that, as expected, 
the sampling error increased with the distance 
from the mean of the distribution. The apparent 
exception, Method 2a, shows an artificial reduc- 
tion in variance of transformed scores corres- 
ponding to the mean minus two standard devia- 
tions due to the fact that this value—the mean 
minus two standard deviations of the total 1949 
population—was lower than the lowest observed 
score in some samples. For purposes of com- 
putation, the lowest observed score in that sam- 
ple was used. However, not much importance 
is to be placed on figures obtained in this man- 
ner. Consequently, they have been omittedfrom 
the statistical tests in Section V. 

AS would be expected, the stratified samples 
show less sampling error than the random sam - 
ples for Methods 1, 2a, and 2b; there is an un- 
expected reversal, however, for Methods 3 and 
4, with the stratified samples apparently show- 
ing the larger sampling error. This finding will 
be discussed below. 

In general, the methods seem to break up in- 
to three groups: Method 2a yielding the largest 
sampling error, Methods 1 and 2b yielding less 
error, and Methods 3 and 4 which yield the least 
sampling error.3 Method 1, the mean and sig- 
ma method, shows the smallest sampling error 
of those methods which do not use an ‘‘anchor’’ 
test. Method 4 has a slight advantage over Meth- 
od 3 with respect to sampling errors, but the 
mean transformed scores seem to showa sys- 
tematic bias towards the parameters of the stand- 
ard reference group (ay, = 105.91, Oy, * 24.65); 
for example, the mean transformed scores cor- 
responding to the mean of the total 1949 group 
lie between 115 and 119 for each of the other 
four methods, but it is 111 for Method 4, con- 
siderably closer to the mean of the reference 


group. 4 
V. Statistical Tests—Further Studies 


Although there were 36 sets of transforma- 
tion parameters for a given method and ei ther 
random or stratified sampling, there were not 
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36 independent sets. Inasmuch as there were 
only 6 samples in each population (c ons idering 
either random or stratified samples by them - 
selves), there could be no more than 6 pairs of 
samples without using some sample more than 
once; hence there could not be more than 6 sets 
of parameters for a given method which were 
statistically independent of one another for the 
purpose of significance tests (although there were 
6! ways in which such an independent set of 6 
pairs could have been chosen from the 36). One 
such set of 6 independent pairs was chosen arbi- 
trarily from the stratified samples, and, simi- 
larly, a set of 6 independent pairs of samples 
was chosen from the random samples. These 
same pairs were used in conjunction with each 
of the methods of equating. 

Thus, for the purpose of statistical tests, 
there were 6 values on the 1948 scale, instead 
of 36 values, which corresponded toa given score 
on the 1949 scale by a given method of e q uating 
and either random or stratified sampling. Clear- 
ly, with samples of this size, only relatively 
large differences between samples will be statis- 
tically significant. 

In order to assess the effects of sampling er- 
ror at different points on the distribution, the 
methods were tested against one another, not only 
for scores corresponding to the mean of the to- 
tal 1949 group, but also for scores correspond- 
ing to the mean minus two standard deviations. 
The mean and variance of each set of 6 trans- 
formed scores are shown in Table III. 

Obviously, it would not have been appropriate 
to have used the L, test or the ordinary F ratio 
for the significance of the difference between var- 
iances inasmuch as these tests assume that the 
true correlation between the samples being test- 
ed is zero. In this case the samples of 6 trans- 
formed scores were correlated, since each of 
the methods was applied to the same set of sam- 
ples. Therefore, the differences between meth- 
ods were assessed by means of Wilks’ Lye 2- 
sample test (12).6 However, Wilks’ test is appli- 
cable to only two samples ata time; hence all 
possible pairs of methods were tested against 
each other. 

None of the differences between pairs of meth- 
ods were significant for the stratified samples; 
but for the random samples there were clearly 
significant differences between methods with 
even these few observations in each sample. 

For the random samples, the methods broke 
up into two groups at the mean—those methods 
which use an ‘‘anchor’’ test and those methods 
which do not (Methods 3, 4, and Methods 1, 2a, 
2b, respectively), with the latter group showing 
the greater fluctuation. None of the differences 
within each of these two groups was significant, 
while the difference between any method in the 
first group and any method in the second group 





March, 1956) 


was significant at the 1% level. 

At two standard deviations below the mean 
for the random samples, the methods again 
broke up into two groups—Methods 3 and 4 
forming a group with the least sampling error 
as before, Methods 1 and 2b showing more sam - 
pling error. (Method 2a was not used for sta- 
tistical tests at two standard deviations below 
the mean [ cf. p. 188]). None of the differences 
within a group was significant. All differences 
between the two groups were significant at the 
5% level except for the difference between Meth- 
od 5 and Method 2b, which was significant at the 
1% level. 

Lord (8: 8, 15) has derived formulas for the 
variance of transformed scores for Methods 1 
and 3, the mean and sigma method and the max- 
imum likelihood method under the conditions of 
random sampling. These theoretical values 
could now be compared with the empirical find- 
ings. 

The theoretical variance of transformed 
scores under random sampling by Method 1 is 
given by the formula: 

20 


2 
0%, = —Y (2x +2), (20) 
t 


Zx =(X- ax) /ox , (29) 


where N; in this experiment equals 200; and ay, 
Ox, Oy are population parameters. In evaluat- 
ing this formula, large sample estimates A,, 
0x, By were used based on the samples of 600. 
For Method 3 the variance of transformed 
scores under random sampling is given by: 


o? 
Oye = qf (2 -Pxy ~ Pyy)(2 + Oxy + Pyv) 


272 
+ Y_X (2 - Oxy - Phy)(2 + Pky 


t 
+ phy) (22) 


Zx = (X - ay) /ox (23) 


where Ny again equals 200; and ax, Ox, Fy, Pxv, 
Pyy are population parameters. In evaluating 
this formula, Ay, Ox, Oy, Pxv» Pyyv were re- 
placed by Aix, Gx, Gy, Pxy, Pyy estimated from 
the two total samples of 600 each, where 


Oxy = Txvq (Sy, Sxq /Svq Ox) (24) 


Byv = Tyy, (vy Syq /Sy, Fy) ann 


p and r being, of course, population and sample 

coefficients of correlation, respectively. 
Theoretical estimates of the variance of 

transformed scores were computed, and are 
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shown in Table IV, along with the empirically 
observed variances (these differ from those of 
Table III in that N-1 was used in computing the 
variances for Table IV) and significance levels. 

For Method 1, the F ratio for the difference 
between the theoretical and observed variances 
approaches significance for the values obtained 
at one standard deviation below the mean and ex- 
ceeds the 5% level of significance for the values 
obtained at two standard deviations below the 
mean, but not for those obtained at the mean. 

Consideration of the discrepancy between the 
theoretical and observed variances led to a test 
of the hypothesis that the samples of 100 from 
the 1949 population could reasonably be consid- 
ered random samples from the same population. 
Application of the L, test for homogeneity of var- 
iance to the six random samples of 100 subjects 
who took the 1949 test, resulted in the rejection 
of the null hypothesis at the 5% level, although 
these were, in fact, random samples from the 
same population. But the L, test, like Lord's 
formulas for the theoretical estimates of var- 
iance, is derived on an assumption of normality 
of the population involved. (To be more explicit, 
bi-variate and tri-variate normal populations 
are assumed in Lord’s formulas for Methods 1 
and 3 respectively.) A recheck into the data re- 
vealed that the samples were decidedly platykur - 
tic; the data had been taken from a special equat- 
ing administration in which the selection of sub- 
jects was such that the extremes were heavily 
weighted. Distributions of this kind might ac- 
count for the discrepancies between theoretical 
and observed variances of transformed scores. 

Despite the fact that Method 3, the maximum 
likelihood method, was also derived under an as- 
sumption of normality, it does not seem to be 
very sensitive to failures of that assumption: 
with this non-normal population the theoretical 
and observed variances of transformed scores 
are still in good agreement, as shown in Table 
IV. 

The one further question which was investi- 
gated statistically was the relation between ran- 
dom and stratified sampling. The appropriate 
test is the F ratio with 5,5 degrees of freedom. 
The variances of transformed scores for random 
and stratified samples were compared with each 
other for each method of equating, and for scores 
corresponding to the mean and to two standard 
deviations below the mean of the 1949 population. 
There was only one significant value: for Meth- 
od 4, corresponding to the mean of the 1949 pop- 
ulation, the variance of transformed scores is 
significantly smaller for the random samples 
than for the stratified samples. 8 


VI. Discussion 


From this study, it is quite clear that Meth- 
ods 3 and 4, the methods using a short ‘‘anchor’’ 
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test, are least affected by sampling error. Meth- 
od 4 has a small, but not significant, advantage 
over Method 3 with respect to sampling error in 
this set of observations, but it also involves a 
systematic bias towards the parameters of the 
standard reference population, an effect which 
would be cumulative as successive tests are 
standardized on each other. Method 3 has no ob- 
servable bias, and its sampling errors are al- 
most as small as those of Method 4; it is there- 
fore reasonable to consider Method 3 (the maxi- 
mum likelihood method shown in equations 4 
through 12 ) to be the most satisfactory equating 
method among those investigated here. 

Of those methods which do not usean ‘‘anchor”’ 
test, Method 1, the mean and sigma method, is 
least affected by sampling error, followed close- 
ly by Method 2b, the straight-line approximation 
to equi-percentiles. Method 2a, the direct equi- 
percentile method, which, unlike Method 1, 
makes no assumption about the shapes of the dis- 
tributions on both tests, yields a larger sam - 
pling error than Method 1, as would be expected: 
there are only two parameters to be estimated 
from the data with the mean and sigma method. 
(It is obvious that Method 1 cannot lead to truly 
equated scores unless the assumption that the 
shapes of the distributions are identical is val- 
id. When this assumption is not valid, the equi- 
percentile method is the only method of those 
under consideration which can properly be ap - 
plied. But it follows from the work of Keats (6) 
on the statistical properties of objective test 
scores that any two tests of the same number of 
items which differ with respect to either the mean 
or the standard deviation will differ with respect 
to skewness as well. It is a moot point as to how 
large the errors due to differences in the shapes 
of the distributions must be in order to outweigh 
the increase in errors due to sampling; clearly, 
the former becomes more important and the lat- 
ter less important as sample size increases. ) 

For Methods 3 and 4 the random samples 
yielded less error than the stratified samples. 
This requires a few comments. Stratified sam- 
pling will, on the average, yield smaller errors 
than random sampling.’ If, however, the prin- 
ciples of stratification are irrelevant to the vari- 
ables being measured, it is obvious that the de- 
gree of fluctuation of any parameter due to sam- 
pling would, on the average, be neither better 
nor worse in the stratified samples than in the 
random samples; in this case, the principle of 
stratification (school attended) is related to the 
variable measured (performance on the AC E 
Psychological Examination), but it is alsd re- 
lated to performance on the ‘‘anchor’’ test. Fur- 
ther, the ‘‘anchor’’ test is itself highly corre- 
lated with the ACE Examination (r = .79 to .91) 
so that with Methods 3 and 4, which have been 
designed to compensate for differences between 
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samples by means of the ‘‘anchor’’ test, there 
is likely to be little or no difference remaining 
between institutions; in effect, stratification be- 
comes irrelevant. The difference between ran- 
dom and stratified samples becomes negligible: 
the observed advantage of random samples with 
Methods 3 and 4 might reasonably be attributed 
to sampling fluctuations. 


VII. Conclusions 
The major conclusions of this study were: 


1. Methods 3 and 4, the maximum likelihood 
and standard reference group methods, both of 
which make use of an ‘‘anchor’’ test, involved 
the least sampling error. Method 4 which in - 
volves a systematic error, had slightly less 
sampling error than Method 3, but not signifi- 
cantly so. 

2. The equi-percentile method, Method 2, 
had more sampling error than Method 1, the 
mean and sigma method. 

3. The sampling error of transformed scores 
increased with distance from the mean. 

4. Stratification by institution did not reduce 
sampling error with Methods 3 and 4. 

5. Theoretical estimates of variance derived 
under assumptions of normality did not hold for 
Method 1, but did hold for Method 3, even though 
this population was non-normal. 


FOOTNOTES 


. The mean and standard deviation of the 1948 
ACE for all colleges combined as given in 
the Norms Bulletin (1) were chosen as the 
values for “yg and Vg used in this study. 


These values are 105.91 and 24.65, re- 


spectively. 

. These figures are not trustworthy (cf.p. 188). 

. It is recalled that Method 2a refers to the di- 
rect equi-percentile method; Methods 1 
and 2b refer to the mean and sigma meth- 
od and the straight-line approximation to 
equi-percentiles, respectively, and Meth- 
ods 3 and 4 are the maximum likelihood 
method and the standard reference group 
method, respectively. 

. Actually, the design of the present experi- 
ment does not permit an adequate apprais- 
al of bias. This would be possible only in 
an experiment in which the true conver- 
sion equation were known. However, that 
there is such a systematic bias in the stand- 
ard reference group method can be shown 
analytically (see Appendix A) inasmuch as 
this method does not lead to consistent sta- 
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tistics. 

. Method 2a was not used for statistical tests 
at two standard deviations below the mean 
(cf. p. 188). 

. This is identical with the Pitman-Morgan R 
or t test (10,11) as set forth by Kenny (7) 
except for computational detail. 

. The large F is not significant inasmuch as 
the theoretical variance is larger than the 
observed. (The order of the degrees of 
freedom is consequently reversed. ) 

. It is more than a little dubious that this rep- 
resents a real difference between the ran- 
dom and stratified samples, as is dis- 
cussed below. Inasmuch as no less than 
48 tests of significance (not more than 16 
of which were independent, however) were 
made in this study, it would not be too sur- 
prising if there were one spuriously sig- 
nificant result at the 5% level. 

. Unless, of course, the percentage of the to- 
tal population in each stratum has been in- 
correctly determined. 
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APPENDIX A 


Proof that Method 4, the Standard Reference Group Method, Is Inconsistent 





Method 4, the Standard Reference Group Method, is defined as follows: 


Y* =AX+B, 


(13) 


where Y* is the transformed score on the Y scale corresponding to score X on test x by a member of 


group a, and 


A = Oy, /Ox, » 
B = Ay, ~ Aug, - 


(14) 
(15) 


Here Myy » °yg are the mean and standard deviation of some arbitrarily defined standard group on test y, 
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and Kg ; Ox. are estimates of the mean and standard deviation of the standard group on test x: 


hxy = Mx. + Dave (Aly, > My,,) ’ (18) 


a2 a s? 


2 As 2 
oxy +b (o2 - SV) ; (19) 


Xo xa Vg 


where M ; are the mean and standard deviation of test x for group a, the group that was given 
both test x and test v; byy, is the regression coefficient of test x on test v for group a; Avg a Ove are 


estimates of the mean and standard deviation of the standard group on test v, the ‘‘anchor’’ test; My, , 
Sy, are the mean and standard deviation of group a on test v; 


Alvg = My, + byy g(a y, o My) ; (16) 


%e . Sy, + dyy 4 (Fy . sy ,) (17) 


Here My, , Sy, are the mean and standard deviation of test y, of group 8, the group that was given both 
test y and test v; My, Sy, are the mean and standard deviation of group £ on test v; Dvy, is the regres- 


sion coefficient of test v on test y for group £. 

It will be recalled that we are considering the case where tests x and y are equally reliable tests of 
the same function, and where N,,, the sample size of group a, is equal to Ng, the sample size of group 
8. Further, groups a and § are samples from the same population. 

By the definition of equated test scores(cf., p. 181) the distribution of the transformed scores Y* for 
any population must be identical with the distribution of Y (observed scores on test y) for that population. 
If the total population were available, and if there were no problems of practice effects, the ideal solu- 
tion would be, of course, to administer both tests to all subjects and equate by means of Method 2, the 
equi-percentile method. But, because of the instability of the higher moments, it is generally consider - 
ed more efficient to equate only the first two moments of the distribution where the equating parameters 
are estimated from samples. This is the procedure of Methods 1, 3, and 4. (This may also be justified 
by assuming that the higher moments of the distribution are already identical, e.g., that they are both 
normal curves. ) 

Let us, therefore, define M and V, the mean and the variance of Y*, for group a: 


M “7 EY, (26) 
a 


1 ys(y*-mM)* . 27 
No &' ) (27) 


From the above discussion, it is clear that for any consistent method of equating, as Ng = Ng ap- 
proaches infinity, the sample parameters will converge to the population values, and M and V will con- 
verge to and o2 , the true population mean and variance on test y. 


But, for Method 4, lines (16) and (17) converge to 


Aye * Ay + PyylOv/Fy)Atyg - Ay) , (28) 


oe = OF + PyylOy/OyO¥e - 9) » (29) 


where ay and oy are the population mean and standard deviation on test v, and pyy is the correlation of 
tests v and y for the population. 
Lines (18) and (19) converge to 


“ 


tigy = Mix Pxv(9x/Fy)(Aiyg - ay) , (30) 
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62 = 0% + p32 y(oy/o2)(a2 - of), (31) 
Xy xVOx/ PyOy, ~ Oy 


where a, and oy are the population mean and standard deviation on test x, and pxy is the correlation of 
tests v and x for the population. 
From (27) and (28), lines (30) and (31) become 


Mix, = Aly + PxyPyylOx/Ay)( ay ” ty) ’ 
oz = of + PrvPyv(9x/ Fy (OF - oy) , 


Lines (14) and (15) now become 


A = ayayg /(ox( of + PhvPivingy - op)14) 


B= Alyg - [oyoy, (ox{ 9% + PxvPyv(%¥g - opl*)] 


[ Aly + PxyPyy(9x/9y)] Alyy - Aty | . 


From (27) and (13), 
2 2 
V = A’ Sx, 
From (34) this now becomes 


V = O95 (OH + PxyPyvl Vy - oy) , 


which may be written 


\ 


V 99g / (yl i- PxvPyy! + Ve PxvPyy) . (38) 


which is clearly a function of % and Py vP xv as well as of oy . It is clear that this limit is equal to vy 


only in the special cases that either 
ey = oy , (39) 
PxvPyy =1 (40) 


and hence in all other cases, V is an inconsistent statistic. 
The magnitude of the inconsistency may be assessed by defining an error term (0% - V): 


o 
%y 
oy 1 - PxvPyy! + OF 4 PxvP yy , 





2 
oy - V= oy{l - 


which may be rewritten 


off 1 - pkvefv] 


oy - V= (oy - oy.) > rs : 
Oy 1 - PxyPyy! + Ye PxyPyy 


y 
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oy | i- PxvPyv] 


09 [1 - PxPyyl + FY, PhvP hy 





Therefore, the systematic error in V will be some fraction of (of - Fg) depending upon PvP yy 
obtaining its maximum (in which case V = oy) when PxvPyy equals zero. 


Similarly, from (26) and (13) 
M = AMx,, + B ; 


which from (34) and (35) becomes 


M = My, ° [ oy /(oy | 1 - PxvP Fy! + 0%, PivPy)® | 


[ PxyPyyl[ “Yg ci Ay], 


which is clearly a function of My» ; Oy, 1 Vy, and PxvPyy as well as of Aly It is clear that 


M = Aly 
only in the special cases where either 

MY = Aly 9 

PxvP yy - l ’ 


and hence in all other cases M is an inconsistent statistic. 
The magnitude of the inconsistency may be assessed by defining an error term (ay ~ M): 


PxvP yy" yg 





Aly - M = (Aly - y(1- - 1). 
’ yo 6 [ OY(1 - PxyPhy) + [Fp PxvPyyl ® 





and, since x and y are tests of the same function, 
9 > PxyPyy at: 
therefore, 


"Ye PxvPyv 


1 <1, 
[Fg Phe hy + Fl OR]? ~ 





i. °¥g PxvPyv \ 3 
[ OF, PavPyy + Oy(1 - PkvPyy)] 


< 
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Thus, the systematic error in M will be some fraction of (Ay - Ay @) depending on the value of PxyPyy> 


and attaining its maximum (in which case M = My) when PxvPyy equals zero. 








GROUP-STUDY VERSUS LECTURE-DEMON- 

STRATION METHOD IN PHYSICAL SCIENCE 

INSTRUCTION FOR GENERAL EDUCATION 
COLLEGE STUDENTS’ 


JOHN N. WARD 
Michigan State University 


The Problem 


THE PURPOSE of this investigation was to 
determine whether subject matter in physical 
science would be learned as wel! under a group 
method of instruction as under a lecture-demon- 
stration method in a general education college 
course. Specifically the problem resolved into 
comparing the relative effectiveness of the two 
methods of instruction in achieving two objec- 
tives of general education: (1) recall and recog- 
nition of facts, principles, and symbols, and 
(2) more understanding of implications of facts 
and principles, of pertinent reading material, 
and of problem situations. Critical review of 
the literature concerning investigations of var- 
ious group treatments of school classes indicat- 
ed that (1) student samples were often biased in 
selection, mutilated in matching, or so incom- 
pletely described, that generalizations from 
them to their ultimate student populations are 
not feasible; (2) situations in which the e xp eri- 
ments were conducted were often so limited or 
incompletely described that generalizations 
from them are not feasibie; (3) measuring in - 
struments employed to obtain data were often 
of unknown or unreported reliability and valid- 
ity; (4) statistical techniques employed to ana! - 
yze the data were often obsolete, inappropriate, 
or involved inherent assumptions which were 
not tested or verified to justify their application; 
(5) designs of the experiments often neglected 
to incorporate provisions for essential analyses 
such as interactions between teachers and meth- 
ods, and (6) operational descriptions of the group 
methods employed were often so incompletely 
reported or couched in such vague terms that 
their duplication in further research is rendered 
uncertain. 

In spite of their many limitations, these at- 
tempts to expedite learnings of both subject 
matter and attitudes were based upon modern 
tenets of psychology of learning, such as the 
beneficial aspects of self-generated motivation, 
confidence, and meaningfulness. Stimulating 
tendencies are often revealed in these group 





studies, and further research seems clearly im- 
plied, especially for the student population in 
general education science course situations, 
where members of the classroom group common- 
ly lack backgrounds of science or mathematics 
experience or interest, and are often present in 
the group only because it is required of them 
for reasons which they are unprepared to recog- 
nize or accept as meaningful. If a group meth- 
od could stimulate such students to formulate ob- 
jectives which were meaningful to them, and to 
plan and pursue pertinent activities, evaluated 
by themselves in terms of their objectives, then 
in addition to subject matter, concomitant learn- 
ings might well take the direction of scientific 
behavior toward all evidence and assumptions, 
including personal and social relationships. Sci- 
entific attitudes concerning the collection and val- 
idation of evidence in many situations, plus sci- 
entific open-mindedness and criticisms regard- 
ing interpretations and generalizations from evi- 
dence and assumptions, could become powerful 
factors toward realistic adjustments to the total 
environment. Before an experiment of such an 
extensive nature could be efficiently designed, 
however, it was considered essential to first con- 
duct this experiment to determine whether funda- 
mental subject matter objectives could be achieved 
through a group method in physical science. 


The Design of the Experiment 





In attempting to fulfill the requirements 
of a modern self-contained experiment, random- 
ization, replication, and local control were util- 
ized as fully as possible. Randomization permits 
analysis of variance and covariance ofR. A. Fish- 
er, and is basic for testing hypotheses and for 
estimation. The claim for randomization in this 
experiment is based upon the fact that the sam- 
ples of students were not influenced by the ex- 
perimenter in any way regarding their selection, 
thus allowing the normal forces acting upon this 
population to result in completely typical sam- 
ples of the student population being registered 
in these classes. This claim is verified by the 
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analyses made of the means and variances of 
the five samples’ ACE-Q scores, plus those of 
the preceding two years’ classes. This means 
that the five samples investigated came from 
the same population with respect to ACE -Q 
scores. Also, random processes were ob - 
served in subdividing the samples to obtain the 
data, and in assigning treatments to groups. 
These random processes were also verified by 
analyses of the means and variances of the sub- 
divisions. In all cases, the assumptions of 
equivalent variances within groups were first 
verified before any subsequent analyses of vari- 
ance and covariance were performed to test 
differential treatment mean effects. 

Replication provides a method of validly es- 
timating experimental error; its most import- 
ant purpose is to decrease the error of treat- 
ment comparisons. It is necessary to have at 
least two replications in order to obtain an esti- 
mate of experimental error variance, and this 
was obtained by having two samples to which the 
same treatment was administered. 

Local control assures an experiment its own 
basis for comparisons and the conclusions re- 
garding them, and here the comparison of the 
two methods of instruction randomly assigned to 
the two similar samples provided the basis of 
local control. 

Scientific experimentation is concerned with 
the empirical testing of hypotheses. In order to 
place the burden of showing any significant dif- 
ference between the methods of instruction di- 
rectly upon the evidence obtained from them, the 
following null hypotheses were adopted: (1) There 
is no difference between the subject matter 
achievements of college students who under go 
instruction in physical science for general edu- 
cation by either the lecture-demonstration meth- 
od or a group method, (2) there is no difference 
on recall-recognition type test items, (3) there 
is no difference on more-understanding type 
items. The first hypothesis was tested by uni- 
variate analysis of variance and covariance. 
The second and third were tested by multivari- 
ate analysis of variance. In applying analysis 
of variance to test these null hypotheses, Snede- 
cor’s non-logarithmic equivalent of Fisher's z- 
distribution became the model of the criterion 
with which to test the significance of the evi - 
dence obtained. Due to the natural limitations 
on the sizes of the samples of students investi- 
gated, a five percent level of significance was 
adopted; thus to reject the null hypothesis that 
any observed differences in samples’ means 
would be due to chance factors rather than to 
other causes, the observed differences would 
have to be large enough to be attributable to 
chance factors in five percent or less trials only. 

Analysis of variance consists of the analytic 
process of breaking down the total sums of 





squares of variation from the grand mean into 
component parts attributable to appropriate 
sources, and then converting them into mean 
squares through division by the proper number 
of degrees of freedom. Such analysis involves 
the basic assumption of common variance of the 
characteristic measured within each of the sam- 
ples compared. This assumption was in every 
case first tested by the F-test for two groups, 
or by the Welch-Nayer L-test for more than two 
groups. Another basic assumption of analysis of 
variance is that of normality of distribution of 
the variates measured within each of the samples 
compared. This was tested in every case with 
graphs of the raw scores versus their probits, 
and the most deviate curve observed was further 
analyzed by Fisher’s k-statistics. Analysis of 
covariance was used to partial out any differ - 
ences between the samples due to ACE-Q scores 
and to initial abilities in the criterion measure. 
Such analysis involves the assumption of homo- 
geneity of regression coefficients within each of 
the samples compared, and this was first tested 
with the Welch-Nayer L-test or the appropriate 
t-test. 

In addition to the analyses of variance and co- 
variance performed on the total scores from the 
criterion instruments, a further attempt was 
made to refine the information thus obtained 
through multivariate analysis of variance, in 
which the two variates, recall-recognition and 
more-understanding, were analyzed simul tane- 
ously. These two variates were direct expres- 
sions of the two objectives purported to be meas- 
ured in this experiment, and their simultaneous 
analysis recognized the inherent limitations of. 
one variate to adequately describe the syndrome 
of abilities measured. Using a technique de- 
scribed by Moonan (2), the two variates wereas- 
signed as elements of a matrix in which the nat- 
ural properties of their component variables, 
such as means and variances, would assume 
their proper values. 

Two criterion measures were employed, Test 
I (reliability, .804), given at the middle of the 
semester, and six months later as Retest I (reli- 
ability .77) and Test II (reliability, . 57) given at 
the end of the semester. Univariate analysis of 
variance and covariance was used to test the null 
hypothesis for the total scores achieved on each 
of the measures, TestI, Test Il, and Retest I, 
by the total samples of the lecture-demonstra- 
tion and group methods. An additional analysis 
of the group method samples was made for Test 
Il, with and without tension regarding course 
grades, since one fundamental factor among the 
various differences between the two methods was 
the presence of psychological tension regarding 
grades for the lecture-demonstration method 
students, and its absence for the group method 
students. In order to measure its relative ef- 
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fect, this tension factor was created for one ran- 
dom half of the group method students who took 
Test II at the end of the semester. Thus were 
made available for comparative analysis scores 
from two randomly selected samples of the group 
method, both of which had experienced the whole 
semester of this experimental instruction, but 
one of which was measured while not under ten- 
sion with the same instrument as was its equiv- 
alent half while under tension, as the lec ture- 
demonstration method sample also had been. 
Univariate analysis of variance and covariance 
was used to test the null hypothesis for this cri- 
terion measure, Test II, with respect to the total 
sample of the lecture-demonstration method 
(under tension), a random half of the group meth- 
od sample (under tension), and a random half 
of the group method sample (not under tension). 
Three separate analyses were made, each of 
the samples being compared with the other two. 

Each sample whose achievement was tested 
in this investigation was further divided into 
three sub-groups whose data were analyzed in 
addition to that from the total sample. The sub- 
groupings of samples’ data were made on the 
basis of scores achieved by the students on the 
Quantitative Tests of the American Council on 
Education Psychological Examination for Col - 
lege Freshmen, 1948 Edition, and consisted of 
those students in the sample who scored in the 
upper twenty-seven percent, the middle forty- 
six percent, and the lower twenty-seven percent 
of this college’s distribution of these scores, 
based on the scores of more thanone thousand 
students during seven years of testing. Thus 
were made available for comparison the upper, 
middle, and lower sub-groups of all samples, 
the achievements of which could be analyzed in 
addition to that of the total sample. 


The Population Investigated 





The population of students, of whom repre- 
sentative samples constituted the experimental 
subjects, consists by definition of all non-science 
majors enrolled at Pennsylvania College for 
Women. All students in the samples had sur- 
vived their freshman year, and all were regis- 
tered in this general science course as a re- 
quired part of the basic curriculum of this col- 
lege. Means and variances of the two samples 
were compared with each other and with those 
of this course’s previous two years samples of 
ACE-Q scores. No significant differences were 
found among these means or variances, and it 
was therefore assumed that the two samples of 
students who supplied the data for this inVestiga- 
tion were representative of the normal popula- 
tion of students who register for this course in 
this college in these times. The mean AC E-Q 
score of these students is equivalent to the sixty- 
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third percentile for women’s national norms. 


The Two Methods Employed 





The writer was the only instructor for both of 
the methods. In both methods’ classes, the same 
subject matter topics were scheduled by the in- 
structor in the same sequence through the sem- 
ester, with the difference that in the lecture- 
demonstration method the topics were always 
treated in the class by the instructor only, while 
in the group method the topics were treated in 
the class by the group of students with the in- 
structor, and only when the whole group decided 
to do so and selected its areas of treatment with- 
in the topic as scheduled for consideration by the 
instructor. In both methods’ classes, the same 
audio-visual aids,—demonstrations, models, 
films, charts, diagrams, pictures, etc. ,—were 
presented, with the difference that in the lecture- 
demonstration method these aids were arbitrar- 
ily inserted into the classes by the instructor ac- 
cording to his opinion as to their appropriate val- 
ues, while in the group method they were present- 
ed to the classes only if the group decided that 
they would be valuable. In both methods’ class- 
es, the same reading assignments were made 
from the same textbook, with the difference that 
under the lecture-demonstration method the read- 
ings were ‘‘required’’, while under the group 
method they were ‘‘suggested’’. 

The lecture-demonstration method was based 
upon certain assumptions, among which the fol- 
lowing were pre-eminent: (1) course objectives 
were the same for all students, and were the re- 
sponsibility of the instructor, (2) course subject 
matter should be selected by the instructor, (3) 
classroom activities should be determined by the 
instructor in order to motivate and stimulate 
learning, and (4) evaluation of each individual 
student’s achievement in the course was the re- 
sponsibility of the instructor, and should be made 
on the basis of scores attained on valid and reli- 
able measuring instruments. Thus in this meth- 
od the instructor alone decided upon the objec- 
tives, selected the required subject matter for 
each class meeting, planned all classroom activ- 
ities, and prepared the measuring instruments 
for evaluation. Student attendance was required 
at all class meetings, which the instructor usu- 
ally began with a review of the required assign- 
ment’s key points, terms, and symbols, follow- 
ing with explanations, demonstrations, etc., al- 
ways trying to expand the content into further 
implications and generalizations. The instruc- 
tor continually attempted to express to the stu- 
dents his own attitudes concerning potential val - 
ues to them of the material he had selected. 

The group method, too, was based upon cer- 
tain assumptions, among which the following 
were pre-eminent: (1) objectives should be de- 
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veloped by the whole group during the course, 
as both products of, and stimuli to learning, 
(2) subject matter for study during the course 
should be selected by the instructor as expert, 
for consideration by the whole group, (3) class- 
room activities, with relative emphases on sub- 
ject matters, should be decided by the whole 
group in order to motivate and stimulate learn- 
ings, and (4) evaluation of each individual stu- 
dent’s achievement should be made by the student 
himself, in order to render his own developed 
objectives, his emphasized studies, and his re- 
sulting learnings most meaningful to him. Thus 
in this method the responsibilities and opportun- 
ities for the development of objectives, subject 
matter emphases, classroom activities, and 
evaluation of achievement became those of every 
member of the class group. In this group meth- 
od, the instructor continually attempted to ex- 
press to the students his attitudes regarding the 
above assumptions, and to stimulate student 
verbalizations of their reactions to the course 
material and method, with emphasis on preci- 
sion and clarity in all verbalizations, both oral 
and written. He also continually attempted to 
express his recognition of and respect for their 
individual differences in backgrounds, interests, 
and abilities. He insistently proposed that this 
group of differing individuals also recognize and 
respect them, and accordingly try to develop ob- 
jectives and activities which would be meaning- 
ful to the group members in terms of their indi- 
vidual differences. He continually attempted to 
maximize student opportunities and responsibil- 
ities for generating their own criteria for value 
judgments and meaningfulness, and their own 
activities for satisfying those criteria, while he 
minimized student opportunities to satisfy pas- 
sively any criteria arbitrarily imposed by him 
alone. (Detailed descriptions of the methods, 
subject matter treated, and the measuring in - 
struments are provided in the complete thesis. ) 


Analysis of the Experimental Results 





Each univariate analysis of variance and co- 
variance made involved calculations of sums of 
squares and cross-products of the test scores, 
testing equality of variances and regression co- 
efficients within groups, organizing analysis of 
variance tables, applying the F-test, adjusting 
sums of squares for the variables to be par - 
tialled out, testing equality of variances of in- 
itial measures within groups, organizing anal- 
ysis of variance and covariance tables, and test- 
ing the null hypothesis by the F-test. 

The various steps followed in the calculations 
of the five univariate analyses of variance and 
covariance made on the total scores achieved 
by the various samples compared in the exper- 
iment are here explained forthe one case of the 
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lecture-demonstration method (L-D), compari- 
son with the group method (G), on TestI. The 
total scores and calculated quantities were ob- 
tained for the criterion test (Y), the pretest (X,), 
and the ACE-Q test (X,, and are displayed in 
Table I. 

For the analysis of variance of the criterion 
scores (Y), only the quantities £(Y), x(Y)* and 
(n) are required for the calculations: 

Sum of squares ‘‘between methods’’ (for Hy- 
pothesis) here symbolized as SSyy(H) = sum of 


squares of deviations of the method means about 
the grand mean = (3015)*/74 + (1690)*/45 - (3015 
+ 1690)*/(74 + 45) = 284.347. 

Sum of squares ‘‘within methods’’ (for error), 
here symbolized as SSyy(e) = sum of squares of 


deviations of Y-scores in a method sample about 
their method means = (125, 145 + 65,038) - (3015)* 
/74 ~ (1690)*/45 = 3, 873. 233. 

These were recorded in Table Il, with their 
appropriate numbers of degrees of freedom 
which are, in this case: for SSyyiq), the num- 


ber of methods less one (2 -1= 1), and for 
SSyy(e); the number of scores of both m ethods 
less the number of methods (74 + 45 - 2 = 117). 

The values under the column heading ‘‘Mean 
Square’’ (Table II) are cbtained by dividing the 
sum of squares in each row by the correspond- 
ing number of degrees of freedom. The observed 
value of the criterion Fo is obtained by dividing 
the ‘‘between methods’’ mean square (for Hypoth- 
esis), by the ‘‘within methods’’ mean square (for 
error). Entering the F-tables with n, = 1, n,= 
117, the value of F_ 9 is found to be less than 
6.90, and since the observed value of 8. 589 is 
greater than 6.90, it may be concluded that 
there is a significant difference at the one per- 
cent level between the mean scores of the two 
methods, and that the null hypothesis may be re- 
jected. 

The analysis of variance of criterion test 
scores only does not take into account any differ- 
ences among the students of the methods with re- 
spect to pertinent factors which may have been 
operating in addition to the methods themselves 
to affect significantly the results obtained as cri- 
terion scores. It is a purpose of the covariance 
technique to eliminate any such inequalities ex- 
isting between the methods with respect to basic 
characteristics, i.e., general ability and specif- 
ic ability in the material tested by the criterion. 
The process of applying the analysis of covar- 
iance consists in breaking up the sum of pro- 
ducts into parts assignable to different factors, 
comparable to the process of breaking up the 
sum of squares in the case of analysis of vari- 
ance. Here the factors measured by the ACE-Q 
test and the pretest are to be eliminated from 
the adjusted mean squares in order to determine 
whether these factors influence significantly the 
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TABLE I 


TOTAL SCORES AND CALCULATED QUANTITIES FOR CRI- 

TERION TEST (Y), PRETEST (X,), AND ACE-Q TEST (X,), 

OBTAINED BY BREAKDOWN OF DATA FROM THE STU - 

DENTS OF THE LECTURE-DEMONSTRATION METHOD (L -D) 
AND THE GROUP METHOD (G) ON TEST I 








L-D G 

n 14 45 
_ &(X2) 3, 219 1,933 
=(x,)* 149, 099 86, 529 
=(X,) 1, 220 796 
=(x,)? 23, 532 17, 154 
2(Y) 3,015 1, 690 
=(y¥)* 125,145 65, 038 
xly-X,) 50,514 30,924 
x(¥- X,) 131, 982 73,479 
=(X,- Xz) 54, 388 34,423 





(= = Sum) (n = number of students in sample) 
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TABLE I 


UNIVARIATE ANALYSIS OF VARIANCE OF TOTAL SCORES ACHIEVED BY TOTAL SAMPLES 
OF THE LECTURE-DEMONSTRATION AND GROUP METHODS ON TEST I 





meee ST ees ee oa === 


Source of Degrees of Sum of Mean 
Variation Freedom Squares Fo Hypothesis 


Between Methods 1 284. 347 284. 347 8. 589** Rejected 


Within Methods 117 3,873. 233 33.105 


(** = one percent level of significance) 


TABLE Il 


SUMS OF SQUARES AND CROSS-PRODUCTS FOR UNIVARIATE ANALYSIS OF VARI- 
ANCE AND COVARIANCE OF TOTAL SCORES ACHIEVED BY TOTAL SAMPLES 
OF THE LECTURE-DEMONSTRATION AND GROUP METHODS ON TEST I 





SS, . x x XPx, x 
(a,,) =(ao2) , = ; =(a12)=(a) 


40. 458 8. 295 


e: 6,492. 130 12, 568. 411 1, 837.021 1,713. 611 1, 548. 378 


D(e+H): 6, 532. 588 12, 576. 706 1,729. 765 1, 762.176 1, 530. 059 

H = for Hypothesis, “between methods” 22 23 =. 

e « for error, ‘‘within methods’’ 

(ay), (@ae), (@.), (We), (Ase), (ag ,) = matrix elements for further calculations by methods 
of matrix algebra 


TABLE IV 


UNIVARIATE ANALYSIS OF VARIANCE AND COVARIANCE OF TOTAL SCORES ACHIEVED BY TOTAL 
SAMPLES OF THE LECTURE-DEMONSTRATION AND GROUP METHODS ON TEST | 


Source of Adjusted Sum Adjusted 7 
Variation df of Squares Mean Square Fo Hypothesis 


Between Methods (H) l 328. 769 328. 769 11. 741** Rejected 


Within Methods (e) 115 3, 220. 069 


(**= one percent level of significance) 
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results of the different methods as measured by 
the criterion test. For this purpose the follow- 
ing calculations are required: 

Sum of squares ‘‘between methods”’ (for Hy- 
pothesis) of X, scores, here symbolized as 
SSx, x, (H) = sum of squares of deviations of the 


method means about the grand mean = (1220)*/74 

+ (796)*/45 - (1220 + 796)? /(74 + 45) = 40. 158. 
Sum of squares ‘‘within methods’’ (for error 

of X, scores, here symbolized as SSx, x, (e) * 


sum of squares of deviations of X, scores ina 
method sample about their method means = 
(23,532 + 17,154) - (1220)*/74 - (796)?/45 = 
6, 492. 130. 

Sum of squares ‘‘between methods’’ (for Hy- 
pothesis) of X, scores, here symbolized as 
SSx.x,(H) = Sum of squares of deviations of the 
method means about the grand mean = (3219)7/74 
+ (1933)°/45 - (3219 + 1933)?/(74 + 45) = 8.295. 

Sum of squares ‘‘within methods’’ (for error) 
of X, scores, here symbolized as SS, x.(e) = 


sum of squares of deviations of X, scores ina 
method sample about their method means 
(149, 099 + 86,529) - (3219)*/74 - (1933)*/45 
12, 568.411. 

Sum of cross-products *‘between methods’’ 
(for Hypothesis) of Y- X, scores, here symbol- 
ized as XP yx, (H) = (3015)(1220)/74 + (1690) 


(796)/45 - (3015 + 1690)(1220 + 796)/(74 + 45) = 
~ 107.256. 

Sum of cross-products ‘‘within methods’’ 
(for error) of Y-X, scores, here symbolized 
as XPyx, (e) = (50,514 + 30,924) - (3015)(1220) 


/74 - (1690)(796)/45 = 1, 837.021. 

Sum of cross-products ‘‘between methods’’ 
(for Hypothesis) of Y- X, scores, here symbol- 
ized as XP yx, (H) = (3015)(3219)/74 + (1690) 


(1933)/45 - (3015 + 1690)(3219 + 1933)/(74 + 45) 
= 48.565. 

Sum of cross-products ‘‘within methods’”’ 
(for error) of Y- X, scores, here symbolized 
as XPyx,(e) = (131, 982 + 73,479) - (3015)(3219) 


/74 - (1690)(1933)/45 = 1,713. 611. 

Sum of cross-products ‘‘between methods’’ 
(for Hypothesis) of X,-X, scores, here symbol- 
ized as XPx, x,(H) = (1220)(3219)/74 + (796)(1933 


/45 - (1220 +796)(3219 + 1933)/(74 + 45) = 
-18.319. 

Sum of cross-products ‘‘within methods’”’ 
(for error) of X,-X, scores, here symbolized 
as XP x, x,(e) = (54, 388 + 34,423) - (1220)(3219) 
/74 ~ (796)(1933)/45 = 1, 548.378. 

These calculated sums of squares and cross- 
products are conveniently arranged in Table III. 

In order to obtain the adjusted values of the 
sum of squares ‘‘between methods’’ (for Hypoth- 
esis) of the Y scores, here symbolized as 
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*SSyy(H), and of the sum of squares ‘‘within 
methods’’ (for error) of the Y scores, here sym - 
bolized as *SSyy(e), Moonan’s matrix algebra 
technique was employed as follows, using the 
matrix element symbols noted above: 


For error: 
ary by + a; 2 b, = a 
a2 b; + aaa Do = Be 


solving for b, and b,, 
(e)2 = (B, 82, - B2Ay,)/(ay2aq, - Ay) 422) 


= (2,844, 402.902 - 11, 124, 985.381) 
/(2, 397, 474.431 - 81,595, 758.105) 


= (-8, 280, 582.479)/(-79, 198, 283.674) 
= 0.104555 


(0. 238501)(0. 104555) 


= (0. 282961) - (0.024936) 
= 0. 258025 


; 
(e)B' G = SSE(e) = (eb, ebs)(Sg2) = (473. 997) + 
(179. 167) = 653. 164 


*SSyy(e) = SSyy(e) - SSE(e) = (3,873. 233) - 
(653. 164) = 3, 220. 069 


For error + Hypothesis: 


solving for b, and b,, 


= (2,646, 642. 506 - 11,511, 569. 791) 


+ ba 
(oom) /(2, 341, 080. 543 - 82, 158, 438. 695) 


= (-8, 864, 927. 285)/(-79, 817, 358. 152) 


0. 111065 


(e+H)P: = (0. 264790) - (0. 234219)(0. 111065) 


(0. 264790) - (0.026014) 


= 0. 238776 


(e+H)B’ G = SSE(¢, 44) = (413.026) + (195. 716) - 
608. 742 


*SSyy(e+H) = SSyy(e) + SSyy(H) - SSE(e+H) 
= (3,873. 233) + (284. 347) - (608.742) 


= 3,548. 838 
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TABLE V 
MULTIVARIATE ANALYSIS OF VARIANCE, SIMULTANEOUSLY ANALYZING TWO VARIATES, 
RECALL-RECOGNITION AND MORE -UNDERSTANDING GAIN SCORES, ACHIEVED BY THE 
UPPER SUB-GROUPS OF BOTH METHODS ON TEST I 
Treatment Variate Sum Sum of Squares and Cross-Products 
le L-D AR Z(AR) = 337 Z(AR)* = 5, 509 ZTWARAU) = 3,122 
(upper) 
fm = 23) AU Z(AU) = 199 L(Qu)*—s =: 2, 141 
2«G AR Z(AR) = 216 (AR)* = 3, 878 ZEARAU) = 1,753 
(upper) e 
(n = 14) AU Z(AU) = 104 Z(AU) = 948 
_ X(AR)/n = 337 /23 = 14. 6521739 = (a) 
, Multivariate 
4 Z(AU)/n = 199 /23 = 8.6521739 = (b) sesaibnels 
in 
AR Matrix Form, 
2 Z(AR)/a = 216 /14 = 15.4285714 = (c) to be 
AU ' Tested 
| p2 X(AU)/n = 104 /14 = 17,4285714 = (d) 








Sum of Squares and Cross-Products due to Estimates: 


(a) 
(b) 


(a)} Z(AR), 
(b)| Z(AR), 


4, 937. 7826 
2,915. 7826 


, 270. 3540 
, 520. 3540 


| 
“{ 
{ 
Ha 


= (AR)? 


{tim 
te 








1, 116. 6460 
354. 6460 


(a)| Z(QU), 
(b)| Z(AU), 


2,915. 7826 
1,721. 7826 


4,520. 3540 
2,494. 3540 


: (AR AU) 


5 


4, 
3, 


: 2 
z (AU) 


875 
089 


354. 6460 
594. 6460 


{>1aR), nav), } + { 


} + { 
th 
, 


um of Squares for Error = (SS total) - 


(c) 
(d 


(d)| Z(AR), 


3, 332.5714 
1,604. 5714 


(c) amie) 


{ ZAR), av), } 


(c)[ Z(AU), | 
(d)j ZU), 


1, 604. 5714 Y 
772. 5714 ; 





(SS estimated) = 


- { (SS cena } = 


8, 270. 3540 
4, 520. 3540 


} as (A) ’ 


4,520. 3540 
2,494. 3540 


} ; 





Continued 














March, 1956) WARD 205 





TABLE V (Continued) 


Sum of Squares and Cross-Products due to Null Hypothesis: AR AR 
pl w2 
AU ' AU : 
wl u2 


2 
. (AR)/(n, +n) = (e) 


(e2) (ef) 
(n, +M2) (e) (f) = (n,+n,) - 
2 (ef) (f*) 
= (AU)/(n,+n,) = (f) 
1 
223. 38130 122. 39518 
= (37) 14. 945946 8. 189189 
122. 39518 67. 06282 
8, 265. 1081 4, 528. 6217 
4, 528.6217 2,481. 3243 








Sum of Squares and Cross-Products of the Hypothesis Contrast Function = (SS and XP due to Estimates) - 
(SS and XP due to Null Hypcthesis) = 


8, 270. 3540 4,520. 3540 8, 265. 1081 4,528. 6217 
4, 520. 3540 2,494. 3540 4,528. 6217 2,481. 3243 
5. 2459 -8. 2677 
= = (B) , 
-8. 2677 13. 0297 


(B) - 8{ (A) + (B)| = 0, where (A) and (B) have independent Wishart distributions, 


[ B - 6(A + B)] [ B - 6(A + B)] 
therefore: = 0 
[ B - ®(A + B)} [B - (A + B)] 
[ 5.2459 - @(1, 121. 8919)] [ -8. 2677 - 0(346. 3783)| 
0 
[ -8. 2677 - @( 346. 3783)] { 13. 0297 - 6(607. 6757)| 
solving for 8: (elements a,,- ag - A@j2-4,) = 
= (561,768. 5190)0* - (-23,533.2247)@ = 0 , 
therefore: 8 = zero and 0.041891 
p(variates) (n,+n,-2-p+i)( @ ) 
Fi rr | 
n,+n,-2-p+l 
2 (34)(0. 041891 
Fl l= (2 yO osBr0B = 9743 
34 
z = Sum AR Gain Score on Recall-Recognition Items 
SS = Sum of Squares AU Gain Score on More-Understanding Items 
XP = Sum of Cross Products p = Parameter Mean 
G = Group Method L-D «= Lecture-Demonstration Method 
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= *SSyyie+H) - *SSyy(e) 


= (3,548.838) - (3,220.069) = 328. 769 


*SSyy(H) 





The adjusted sums of squares ‘‘between 
methods’’ (for Hypothesis) here symbolized as 
*SSyy(H)» and ‘‘within methods’’ (for error) 


*SSyy(e): were recorded in the analysis of var- 


iance in Table IV, with their appropriate num- 
bers of degrees of freedom which are in this 
case: for *SSyy(H)> the number of methods less 


one (2 - 1 = 1), and for *SSyy(e)> the number of 


scores of both methods less the number of meth- 
ods less the number of b’s calculated in the ad- 
justments (74 + 45 - 2 - 2 = 115). 

Entering the F-tables with n,; = 1, n, = 115, 
the value of F gj is found to be less than 6. 90, 
and since the obtained value of 11.741 is great- 
er than 6.90, it may be concluded that there is 
a significant difference between the mean scores 
of the two methods at the one percent level when 
the effects of the ACE-Q and pretest scoresare 
partialled out, and that the null hypothesis may 
be rejected. 

Multivariate analysis of variance was also 
used for each of the criterion measures with 
respect to all total samples, and inaddition was 
applied to each of the samples’ three sub-groups: 
the upper twenty-seven percent achievers on the 
ACE-Q test, the middle forty-six percent, and 
the lower twenty-seven percent. In the multi- 
variate analyses, the criterion test items were 
scored under the two classifications: recall - 
recognition, and more-understanding. Gain 
scores on these two variates were obtained by 
subtracting each student’s pretest score from 
her final score, and these gain scores were 
then analyzed simultaneously, instead of separ- 
ately, by means of the multivariate technique. 
The multivariate analyses consisted of calcula- 
tions of sums of squares and cross-products of 
both variates, testing equality of variances of 
initial scores, final scores, and gain scores on 
both types of items within groups, organizing 
a multivariate analysis of variance table, calcu- 
lating the theta criterion, and applying the F- 
test of the null hypothesis. The analysis fol - 
lowed the technique described by Moonan. 

The calculations made during the multivari- 
ate analyses of the two gain scores also yielded 
statistics with which univariate analyses of var- 
iance could be made on each gain score separ- 
ately, instead of simultaneously, and this too 
was done. 

The various steps followed in the calculations 
of multivariate analysis of variance are here ex- 
plained for the case of the upper sub-group of 
the lecture-demonstration method (L-D), com- 
parison with that of the group method (G) on Test 
I. The gain scores on recall-recognition items 
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(AR), and more-understanding items (AU), and 
appropriate quantities were calculatedas shown 
in Table V. 

Entering the F-tables with n, = 2, n, = 34, 
the value of F095 is found to be greater than 
3.23, and since the obtained value of F is 0. 743, 
being less than 3. 23, it may be concluded that 
there is no significant difference at the five per- 
cent level between the means of the samples, and 
that the multivariate null hypothesis of equal 
population mean values for the two variates con- 
sidered simultaneously may be accepted for 
these samples. The analysis is summarized in 
Table VI. 

Calculation of the value of F in the analysis 
of variance which would be made on each variate 
separately, using the diagonal elements of ma- 
trices (B) and (A) from Table VI: 


1. For variate AR = Gain Score on Recall- 
Recognition Items: 


F{ : J=t? - (M1 +Mg-2)(a,, Of B) | 
AR! n, +mg-2) **aR (a,, of A) 


(35)(5. 2459) . 
116.6460) ~ ° 184 


2. For variate AU = Gain Score on More- 
Understanding Items: 


~ (M1 +Mg-2)(Ogg Of B) - 


1 _ 
aul n,+n,-2! =tay - (@ gz of A) 


(35)(13.0297) _ 9 269 


Entering the F-tables with n, = 1, n, = 35, 
the value of F 95 is found to be greater than 4.10, 
and since the obtained values of F for each sep- 
arate variate are 0. 164 and 0.767, both being 
less than 4.10, it may be concluded that there 
is no significant difference at the five percent 
level between the mean gain scores achieved by 
these samples with respect to either variate con- 
sidered separately, and that the null hypothesis 
may be accepted for each separate gain score. 

The results of all tests of significance of dif- 
ference between means are summarized in Table 
vo. 
Difference between standard deviations of pre- 
test and final test scores were also tested for 
significance in the case of every sample investi - 
gated, using the t-test: 


tina) = [ 88 - sf] ¥(n-2)/V] 4(i-r7,)s7 « sf] 


The results of all tests of significance of dif- 
ference between standard deviations are sum - 
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marized in Table VIII. 


Conclusions from the Experiment 


If the two different procedures of the meth- 
ods are regarded as reflecting validly their re- 
spective basic assumptions, and if the re sults 
obtained are regarded as reliable manifesta - 
tions of their differences, then the following 
summary conclusions are warranted: 


(1) Since the group method resulted in longer 
retained more-understanding type of learning, 
and also in greater expression of individual dif- 
ferences in such learning on the part of the up- 
per sub-group of the students, therefore, the 
group method should be employed when it is de- 
sired to produce greater expression of individ- 
ual differences on more-understanding type of 
learning of subject matter among the most cap- 
able students. 

(2) Since the lecture-demonstration m e thod 
resulted in greater expression of individual dif- 
ferences in longer retained more-understanding 
type of learning on the part of the lower sub- 
group of students, therefore the lecture-demon- 
stration method should be employed when it is 
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desired to produce greater expression of indi- 
vidual differences on more-understanding type 

of learning of subject matter among the least 

capable students. 

(3) Since the lecture-demonstration method 
resulted in greater expression of individual dif- 
ferences in longer retained recall-recognition 
type of learning on the part of the lower three- 
quarters of the students, therefore the lecture- 
demonstration method should be employed when 
it is desired to produce greater expression of 
individual differences on recall-recognition type 
of learning of subject matter among the less 
capable students, both methods being of equal 
value for achieving such objective in the case 
of the most capable students. 
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EXPERIMENTAL STUDIES OF THE USE OF 
AUDIO-VISUAL AIDS IN VOCATIONAL 
AGRICULTURE 


RALPH R. BENTLEY 
Purdue University 


Introducation 


THE USE OF audio-visual aids such as mov- 
ies, slides, and film strips to aid in the teach- 
ing of vocational agriculture has become a com- 
mon practice with many teachers. The wide 
and general use of audio-visual aids by teachers 
of vocational agriculture has been due to a num- 
ber of factors among which are the following: 
(1) the wide and favorable publicity which audio- 
visual aids have received from many sources, 
including the armed forces which made exten- 
sive use of audio-visual aids during World War 
Il, (2) the availability of movies, slides, and 
film strips, (3) the availability of projection 
rooms and equipment in many schools, and (4) 
the broad and favorable generalizations regard- 
ing the effectiveness of audio-visual aids in 
learning situations which are based in part upon 
experimental research in areas other than voca- 
tional agricujture. 

The evaluation of audio-visual aids in voca- 
tional agriculture has been largely subjective 
and has been done by individual teachers, by 
those who prepare audio-visual aids advertising 
and catalog copy, and by groups of agr iculture 
teachers working through their state ass ocia- 
tions. Only one experimental study evaluating 
the effectiveness of audio-visual aids!* in voca- 
tional agriculture has come to the attention of 
the writer. This study evaluated the effective- 
ness of audio-visual aids on informational and 
applicational learning in the areas of home 
gardens, swine production, and pasture produc- 
tion. In this study of the use of audio-visual 
aids in three projects, the only significant dif- 
ference in favor of the experimental or audio- 
visual aids group was found in the informational 
phase of the home garden prciect. No signifi- 
cant differences were found between the exper- 
imental and control groups in the applicational 
phase of the home garden project and in both the 
informational and applicational learning phases 
of the swine production project and of the pas- 
ture production project. 

The results of audio-visual aids experiment- 
al research? in areas other than vocational ag- 
riculture are also in conflict, that is, incertain 
instances the experimental or audio-visual aids 





groups were significantly superior to the control 
groups while in other instances the experimental 
groups were not significantly superior to the con- 
trol groups. 

The results obtained from limited experiment- 
al research designed to evaluate the effectiveness 
of audio-visual aids in vocational agriculture are 
not conclusive. It is important both with respect 
to economy and efficiency in education to deter- 
mine more specifically the conditions under which 
it is advisable to use audio-visual aids. Thus, in 
order to evaluate further the effectiveness of 
audio-visual aids in vocational agriculture two 
new experiments were designed. The first of 
these experiments was ‘‘An Experimental Evalu- 
ation of Certain Audio-Visual Aids in Teaching 
Soil Conservation.’’ This experiment was de- 
signed to evaluate the effectiveness of audio-vis- 
ual aids in a typical teaching and classroom situ- 
ation where students normally use reference ma- 
terials that contain many pictures and illustra- 
tions. The second experiment was ‘‘An Experi- 
mental Evaluation of the Effectiveness of Audio- 
Visual Aids in Teaching Permanent Pasture Pro- 
ducation.’’ This experiment was designed to test 
the hypothesis that audio-visual aids are effective 
in learning situations when they provide students 
with new audio-visual experiences, that is, ex- 
periences they have not had or do not secure 
through other instructional materials included in 
the instructional unit. 

Following are reports of these two experiments 
and their implications for instruction in vocation- 
al agriculture. 


EXPERIMENT | 
An Experimental Evaluation of the Effectiveness 
of Certain Audio-Visual Aids in Teaching Soil 
Conservation 


Purpose of the Experiment 





This experimental study was designed to deter - 
mine the effectiveness of certain available audio- 
visual aids on learning when they were used as 
part of the regular classroom instructional ma- 
terials for vocational agriculture students who 
were studying soil conservation. 


*All footnotes will be found at the end of this article. 
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The Design of the Experiment 





The Cooperating Schools—The schools which 
cooperated in the study were located in central 
Indiana and were selected on the basis of the 
following criteria: (1) school was located in 
central Indiana, (2) twenty or more freshmen 
students were enrolled in separate classes of 
vocational agriculture, and (3) projection room 
and sound movie projector were available for 
use of the vocational agriculture department. 

In addition to the above criteria, the writer 
visited the administrators and teachers of agri- 
culture in each of the cooperating schools inor- 
der to gain first-hand information regarding 
their willingness to cooperate in the study and 
to explain the details for carrying out their 
part of the experiment. 

The Selection of the Experimental Project— 
The unit of work taught in this study was Soil 
Conservation. This unit was selected because 
(1) it was generally included in vocational agri- 
culture courses of study in the schools of Indi- 
ana, (2) it could be organized as a relatively 
short unit (one week) of work, (3) audio-visual 
aids dealing with soil conservation were avail- 
able that were representative of the best agri- 
culture movies, and (4) soil conservation was 
being stressed by national, state and local 
school and governmental agencies. 

The Selection of the Audio-Visual Aids — The 
audio-visual aids used in this study were select- 
ed after the available soil conservation audio- 
visual aids had been carefully previewed by 
members of the Agricultural Education staff of 
Purdue University. The following factors were 
taken into consideration in the selection of the 
movies: (1) the educational level for which the 
movies were best suited, (2) subject matter con- 
tent, (3) quality of photography, (4) technical 
construction, (5) quality of sound effect, and 
(6) the physical condition of the films. 

Two soil conservation movies were used in 
this study. They were ‘‘Permanent Agriculture,’ 
an International Harvester Company sound, col- 
ored movie, which was used to help students 
recognize the importance of soil conservation, 
and ‘‘Planning to Prosper,’’ an Allis Chalmers 
sound, colored movie which was used as an il- 
lustrated review of important soil conservation 
problems and those practices which are useful 
in conserving soils. 

The Reference Materials—The reference ma- 
terials used in this study were those which are 
normally available and are used by schools that 
teach vocational agriculture. 

The Independent Replications— This investi- 
gation was made in eleven different schools and 
consisted of eleven separate and independent rep- 
lications or experiments in which 236 students 
participated. Each of the eleven independent 

















(Vol. 24 


replications had its own experimental group and 
control group. Although the independent replica- 
tions were complete experiments within them- 
selves, certain features were common to all. 
The common features were, (1) all students 
were taught the same unit of work on soil con- 
servation, (2) all students were administered the 
same tests, (3) all experiments were conducted 
for the same length of time, (4) the experiment- 
al group in each of the independent replications 
used the same audio-visual aids as part of their 
instruction, and (5) both the experimental and 
control groups in each independent replication 
were taught by the same instructor. 

The Experimental and Control Groups—In 
each independent replication there was an exper- 
imental or audio-visual aids group and a control 
or non-audio-visual aids group. In schools hav- 
ing two sections of the same class, one section 
was used as the experimental group and the other 
section the control group, the treatment being 
determined by lot. In schools having only one 
class section, the class was divided at random 
in order to have an experimental group and a 
control group. The experimental and control 
groups in divided class situations were taught on 
alternate days and were taught separately 
throughout the experimental period. 

The eat Rutiieona the Henmon-N e 1 son 
Test of Men ty and the Soil Conservation 
Achievement Pre-Test were administered to all 
participating students at the beginning of the in- 
structional period and the Soil Conservation A- 
chievement Post-Test was administered at the 
end of the instructional period. The results of 
the first two tests were used as bases for match- 
ing the experimental groups and the control 
groups, while the results of the post-test were 
used as the measure of achievement. The Soil 
Conservation Achievement Test was constructed 
by the writer, was validated on the basis of ex- 
pert opinion, and had a reliability coefficient of 
.85 as measured by the split-half method when 
corrected for length by the Spearman-Brown 
formula. 

The Statistical Analysis—The raw scores ob- 
tained from the administration of the tests, al- 
ready described, in the independent replication 
(by schools) provided the basic data for the sta- 
tistical computations. The results of the Hen- 
mon-Nelson Test of Mental Ability and of the 
Soil Conservation Achievement Pre-Test were 
used as bases for matching the experimental 
groups and the control groups, while the results 
of the soil Conservation Achievement Post-Test 
were used to measure achievement. 

In experimental studies of this type which in- 
volve replications, the differences among the 
experimental and control groups may be due in 
part to differences among the replications and 
in part to the methods of instruction used in the 
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experimental and control groups. Thergto re, 
the analysis of variance and covariance” was 
used to test the significance of all differences. 
The first step was to test the significance of in- 
teraction, 6 that is, to determine whether the 
variability among the replications was greater 
than would occur by chance. The second step 
was to use the appropriate formula, ‘' based up- 
on whether interaction was negligible or present, 
to test the significance of the difference between 
the pooled experimental groups and the pooled 
control groups when freed from differences 
which are due to interaction. 


The Experimental Results and Their 
Interpretation 


The statistical analyses in this study are 
based on the results which were obtained from 
eleven independent replications which were car- 
ried out in eleven schools in central Indiana. 

The Analysis of Variance and Covariance— 
The analysis of variance and covariance was 
used first to test the significance of interaction 
or the variability among the independent repli- 
cations, and second, to test the significance of 
the differences between the achievement of the 
pooled experimental groups and the pooled con- 
trol groups when freed from interaction. In mak- 
ing these two tests of significance the results of 
the Henmon-Nelson Test of Mental Ability and 
the Soil Conservation Achievement Pre-Test 
were used as the bases for matching and the Soil 
Conservation Achievement Post-Test was used 
as the criterion of achievement. 

The results of the analysis of variance and 
covariance tests of significance are shown in 
Table I. 

As shown in Table I, interaction among the 
independent replications was not significant with 
a probability between .10 and . 25, that is, the 
chances of real interaction being present was be- 
tween 10 and 25 in 100. Thus the test of signif- 
icance of the difference between the pooled ex- 
perimental or audio-visual aids groups and the 
pooled control or non-audio-visual aids groups 
was made assuming that interaction among the 
independent replications was negligible. As 
shown in Table I, the difference between the ex- 
perimental and control groups was not signifi- 
cant with a probability between .60 and .70 0r 
a chance difference between 60 and 70 in 100. 
This indicates that the audio-visual aids and 
the non-audio-visual aids groups were not sig- 
nificantly different in achievement as measured 
by the Soil Conservation Achievement Post - 
Test when the results of this test were adjust- 
ed on the basis of the Henmon-Nelson Test of 
Mental Ability and the Soil Conservation Achieve- 
ment Pre-Test. 

Table II shows the analysis of variance for 
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the unadjusted scores of the Henmon-Nelson 
Test of Mental Ability. 

As shown in Table II, interaction between 
schools was not significant with a probability be- 
tween .50 and .25. There was also a non-signif- 
icant difference between the methods groups in 
mental ability with a probability between .25and 
.20. This indicates that the pooled experiment- 
al and the pooled control groups were not signif- 
icantly different in intelligence as measured by 
the Henmon-Nelson Test of Mental Ability. 

Table III shows the analysis of variance for 
the unadjusted Soil Conservation Achievement 
Pre-Test scores. 

The interaction between schools when comput- 
ed on the basis of the Soil Conservation Pre-Test 
scores was not significant, the probability being 
.50. However, there was a significant differ- 
ence between the methods groups with a probabil- 
ity between .02 and .01. This indicates that the 
pooled experimental groups were significantly 
superior to the pooled control groups at the begin- 
ning of this experimental study as measured by 
the Soil Conservation Achievement Pre-Test. 

Table IV shows the analysis of variance for 
the unadjusted Soil Conservation Achievement 
Post-Test scores. 

As shown in Table IV, interaction between 
schools was not significant as indicated by a prob- 
ability between .10 and .05. The difference be- 
tween methods was also not significant, the prob- 
ability being between .10 and .05. This indi- 
cates that the difference between the pooled ex- 
perimental groups and the pooled control groups 
was not significantly different after instruction 
as measured by the Soil Conservation Ac hieve- 
ment Post- Test. 


Summary and Conclusions 





1. This study was designed to evaluate the effec- 
tiveness of certain audio-visual aids when 
they were used as part of the regular class- 
room instruction of vocational agriculture 
students who were studying Soil Conservation. 


. This investigation was conducted in eleven 
different schools and consisted of eleven inde- 
pendent replications in which 236 students par - 
ticipated. 


. Two sound, colored movies were usedas part 
of the regular classroom work of the experi- 
mental groups in the independent replications. 
They were: ‘‘Permanent Agriculture, ’’ anIn- 
ternational Harvester Company movie, and 
‘*Planning to Prosper,’’ an Allis Chalmers 
movie. 


. The reference materials used in this study 
were typical of soil conservation literature in 
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TABLE I 


ANALYSIS OF VARIANCE AND COVARIANCE OF THE PASTURE PRODUCTION ACHIEVEMENT 
POST-TEST SCORES WITH THE HENMON-NELSON TEST OF MENTAL ABILITY AND THE 
PASTURE PRODUCTION PRE-TEST SCORES HELD CONSTANT 


Adjusted Analysis 











Sum of Mean 
Source of Variation d.f. Squares Square F Hypothesis* 
Interaction 10 732. 8947 73. 2895 1. 423 Accepted 
Within cells 212 10916. 2428 51.4917 . 25>P>.10 
Interaction and within cells 222 11649. 1375 52. 4736 
Between methods 1 11.5555 11.5555 . 220 Accepted 
Methods and pooled error 223 11660. 6930 . 70>P>. 60 
*The null hypothesis 

TABLE II 


ANALYSIS OF VARIANCE FOR THE UNADJUSTED SCORES OF THE HENMON-NELSON 
TEST OF MENTAL ABILITY 








Source of Mean 

Variation d.f. Square F Probability Hypothesis 
Withincells 214 154.011 "4 

Interaction 10 153. 839 0.999 .50>P>. 25 Accepted 
Pooled error 224 154. 003 


Between methods 1 245.754 1. 596 . 25>P>. 20 Accepted 
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TABLE I 


ANALYSIS OF VARIANCE FOR THE UNADJUSTED SCORES OF THE SOIL CONSERVATION 
ACHIEVEMENT PRE-TEST 








Source of Mean 
Variation ‘Se Square F 


Within cells 92.409 





Interaction 52.145 >, Accepted 


Pooled error 90. 611 


Between methods 507. 468 5. 601 .02>P>.01 Rejected 





TABLE IV 


ANALYSIS OF VARIANCE FOR THE UNADJUSTED SCORES OF THE SOIL CONSERVATION 
ACHIEVEMENT POST-TEST 





Source of Mean 
Variation qs Pony Hypothesis 








Within cells 
Interaction 126. 680 ° .10>P>.05 Accepted 
Pooled error 78. 626 


Between methods 230. 847 2.936 .10>P>. Accepted 
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that they contained a wealth of pictures and 
illustrations. These pictures and illustra- 
tions showed both the causes and effects of 
soil erosion together with soil conservation 
practices which are used to control soil ero- 
sion. 


. The Henmon-Nelson Test of Mental Ability 
and the Soil Conservation Achievement P r e- 
test were administered to all students before 
instruction was begun and the results of these 
tests were used as the bases for matching. 


. The soil Conservation Achievement Post - 
Test was administered to all students after 
instruction was completed and the results of 
this test were used to measure achievement. 


. The data obtained in this study were ana- 
lyzed statistically by the analysis of variance 
and covariance. 


. The experimental or audio-visual aids group 
was not significantly superior to the control 
or non-audio-visual aids group in ac hi eve- 
ment as measured by the Soil Conservation 
Post-Test when the results of this test were 
adjusted on the basis of the Henmon-Nelson 
Test of Mental Ability and the Soil Conserva- 
tion Achievement Pre- Test. 


EXPERIMENT II 


An Experimenta! Evaluation of the Effectiveness 
of Certain Audio-Visual Aids in Teaching 
Permanent Pasture Production 


Introduction 


After it was found that the audio-visual aids 
used in Experiment I did not make a significant 
difference in the achievement of students of vo- 
cational agriculture who were studying soil con- 
servation, Experiment II was unde. wken. This 
experimental study was designed to test the hy- 
pothesis that audio-visual aids are significantly 
effective in teaching situations when they pro- 
vide students with new audio-visual experiences 
they have not had or do not secure through other 
instructional materials included in the instruc - 
tional unit. 


Purpose of the Study 





This experimental study was designed to 
determine the effectiveness of certain audio- 
visual aids on learning when they were especial- 
ly selected and arranged to give students in reg- 
ularly organized classes of vocational agricul- 
ture new audio-visual experiences pertaining to 





the improvement of permanent pastures. 


The Design of the Experiment 





This experimental study was designed to be 
carried out under normal school and classroom 
conditions and yet have complete and independ- 
ent replications of the experiment in eachof sev- 
eral cooperating schools. The replications in 
the cooperating schools were made possible by 
using comparable classes in vocational agricul- 
ture in two successive years. 

The Cooperating Schools—The schools which 
cooperated in this study were distributed 
throughout the state of Indiana and were select- 
ed upon the basis of the following criteria: 
(1) separate sophomore classes in agri- 
culture, (2) necessary projection equipment avail- 
able for use of the vocational agriculture depart- 
ment, (3) permanent pasture unit taught in sopho- 
more agriculture class, (4) ten or more students 
enrolled in the sophomore agriculture class. 

In addition to the application of the above cri- 
teria, the writer visited each of the cooperating 
schools in order to gain first-hand information 
regarding the school’s willingness to cooperate 
in the study and to discuss with them the details 
for carrying out their part of the study. 

The Instructional Unit—Since the purpose of 
this study was to determine the effectiveness of 
audio-visual aids when used primarily to give 
students new visual experiences, it seemed de- 
sirable to select an instructional unit dealing 
with a phase of agriculture which is not normal- 
ly given major attention on the typical farm. 
Permanent pastures seemed to be a phase of ag- 
riculture that fitted this description; thus it was 
decided to base the instruction in this experi- 
mental study on The Improvement of Permanent 
Pastures. 

The Audio-Visual Aids—The audio-visual 
aids used in this study were obtained from the 
2 x 2 inch Kodachrome slide files of the Agron- 
omy Department of Purdue University. The 
slides used were selected and arranged especi- 
ally for this study and the selection andarrange- 
ment was based upon the objectives which were 
formulated as a guide for teaching the unit on 
‘*The Improvement of Permanent Pastures.’’ 
The following four sets of slides used in the study 
were: Set A, 11 slides, ‘‘Good and Poor Pas - 
tures,’’ was designed to help students recognize 
and realize the wide differences between good 
and poor pastures; set B, 14 slides, ‘‘Important 
Pasture Plants’’ was designed to help students 
become acquainted with the characteristics of 
the various pasture plants; set C, 20 slides, 
‘*Liming and Fertilizing Permanent Pastures, ’’ 
showed the results obtained from various soil 
treatments in pastures and in experimental pas- 
ture plots; set D, 21 slides, ‘‘Renovation of 
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Permanent Pastures,’’ showed various ways 
and methods of renovating permanent pastures 
and the results of renovation versus no renova- 
tion. 

The Reference Materials—The reference ma- 
terials used were those which are normally 
available to schools that teach vocational ag ri- 
culture. In addition to the pasture reference 
materials which were available in the libraries 
of the cooperating schools, each school was 
supplied with permanent pasture references in 
sufficient quantity for all students. 

The Independent Replications (by schools)— 
This investigation was conducted in twenty dif- 
ferent schools and consisted of twenty independ- 
ent replications or experiments, each having its 
own experimental or audio-visual aids group 
and control or non-audio-visual aids group. Al- 
though the independent replications were com- 
plete within themselves, there were certain fea- 
tures common to all. The common features 
were (1) all students were members of regular- 
ly organized sophomore classes in vocational 
agriculture, (2) all students studied the same 
unit of work, (3) all students were administered 
the same tests, (4) all students in each inde - 
pendent replication had access to the reference 
materials which were supplied to the cooperat- 
ing schools, (5) both the experimental and con- 
trol groups in each independent replication were 
taught by the same instructor, (6) all students 
in each independent experiment had the same 
amount of time for instruction, (7) all students 
in the experimental groups of the independent 
replications had the same audio-visual aids sub- 
stituted for part of their instruction, and(8) the 
experimental and control phases in each of the 
independent replications were taught in succes- 
sive years, the order being determined at ran- 
dom. 

The Testing Program — The Henmon-Nelson 
Test of Mental Ability was administered to all 
students who participated in the experimental 
study and the results of this test were used as 
one of the bases for matching. Achievement 
was measured by a Pasture Production Achieve- 
ment test. This test was constructed by the 
writer and was administered at the beginning of 
the independent experiments as a pre-test and 
at the end as a post-test. The pre-test results 
were used as the other basis for matching while 
the post-test results were used as the measure 
of achievement. The Pasture Production 
Achievement Test was validated on the basis of 
expert opinion and had a reliability coefficient 
of .79 as measured by the Hoyt method! of 
computing test reliability. 

All tests used in this study were scored ob- 
jectively on the basis of keys which were pre- 
pared before the testing was begun. The scor- 
ing was done by or under the supervision of the 
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writer. 

The Statistical Analyses—The raw Scores ob- 
tained from the administration of the tests, al- 
ready described, in the independent replications 
(by schools) provided the basic data for the sta- 
tistical computations. The results of the Henmon- 
Nelson Test of Mental Ability and of the Pasture 
Production Achievement Pre-Test were used as 
bases for matching the experimental groups and 
control groups while the results of the Pasture 
Production Achievement Fost-Test were used 
to measure achievement. 

In experimental studies of this type which in 
volve replications, the differences among the 
experimental and control groups may be due in 
part to differences among the replications and 
in part to the methods of instruction used in the 
experimental and control groups. Therefore, 
the analysis of variance and covariance was used 
to test the significance of all differences. The 
first step was to test the significance of interac- 
tion! 1, that is, to determine whether the var i- 
ability among the replications was greater than 
would occur by chance. The second step was to 
use the appropriate formulal?, based upon 
whether interaction was negligible or present, to 
test the significance of the difference between 
the pooled experimental groups and the pooled 
control groups when freed from differences 
which are due to interaction. 





The Experimental Results and Their Inter- 
pretation 


The statistical analyses in this study are 
based on the results which were obtained from 
sixteen of the twenty independent replications in 
which all phases of the study were completed sat- 
isfactorily. Four hundred and twenty-one 4tu- 
dents participated in the completed replications. 
Teachers changing positions in two schools and 
unusable tests returned from two other schools 
made it impossible for four of the independent 
replications to complete their part of the study. 
Of the sixteen independent replications complet 
ed satisfactorily, eight conducted the exper i- 
mental phase of the study the first year of the 
study and eight conducted the control phase. This 
order was reversed the second year of the study. 

The Analysis of Variance and Covariance— 
The analysis of variance and covariance was used 
first to test the significance of interaction or 
the variability among the independent replica- 
tions and second, to test the significance of the 
difference between the achievement of pooled ex- 
perimental or audio-visual aids group and the 
pooled control or non-audio-visual aids groups 
when freed from variability among the rep] ica- 
tions. In making these two tests of significance 
the results of the Henmon-Nelson Test of Ment- 
al Ability and the Pasture Production Achieve- 
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ment Pre-Test were used as the bases for 
matching the experimental and control groups 
while the Pasture Production Achievement Post- 
Test was used as the criterion of achievement. 

The results of the variance and covariance 
tests of significance are shown in Table IA. 

As will be seen in Table JA the chances of 
real interaction existing among the independent 
replications remains in doubt with a probability 
of .05 or the chances of real interaction being 
5 in 100. Thus, the test of significance of the 
difference between the pooled experimental or 
audio-visual aids group and the pooled control 
or non-audio-visual aids group was made assum - 
ing that real interaction existed among the inde- 
pendent replications. As shownin TablelIA, the 
difference between the experimenta! and control 
groups has a probability between .025 and .01 or 
a chance difference between 2.5 and 1 in 100. 
This indicates that the audio-visual aids group 
was on the average significantly superior to the 
non-audio-visual aids group in achievement as 
measured by the Pasture Production Ac hieve- 
ment Post-Test when the results of this test 
were adjusted on the basis of the Henmon - 
Nelson Test of Mental Ability and the Pasture 
Production Achievement Pre-Test. 


Summary and Conclusions 


1. This experimental study was designed to eval- 
uate the effectiveness of audio-visual aids 
which were especially selected to give soph- 
omore students, in regularly organized 
classes of vocational agriculture, new visual 
experiences pertaining to the improvement of 
permanent pastures. 


. This investigation was conducted in twenty 
different schools and consisted of twenty in- 
dependent replications. Sixteen of the twenty 
schools and 421 students satisfactorily com- 
pleted all phases of the study. 


3. Four sets of Kodachrome slides were used 
as part of the regular classroom instruction 
of the experimental group in the independent 
replications. The slide sets were entitled 
(a) ‘‘Good and Poor Pastures, "’ (b) ‘‘Import- 
ant Pasture Plants,’’ (c) ‘‘Liming and Fer - 
tilizing Permanent Pastures, ’’ and (d) ‘‘Ren- 
ovation of Permanent Pastures. ’’ 


. The reference materials used in this study 
consisted primarily of the bulletins, circu- 
lars, and mimeograph materials which were 
furnished to the cooperating schools. The 
bulletins and circulars contained very 
few pictures and illustrations which would 
give the typical farm boy new visual e xperi- 
ences while the mimeograph materials con- 
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tained none at all. 


. The Henmon-Nelson Test of Mental Ability 
and the Pasture Production Achievement Pre- 
Test were administered to all students before 
instruction was begun and the results of these 
tests were used as the bases for matching. 


. The Pasture Production Achievement P ost- 
Test was administered to all students after 
instruction was completed and the results of 
this test were used to measure achievement. 


. The data obtained in this study were analyzed 
statistically by analysis of variance and co- 
variance. 


. The experimental or audio-visual aids group 
was significantly superior to the control or 
non-audio-visual aids group as measured by 
the Pasture Production Achievement P ost - 
Test when the results of the test were adjust- 
ed on the basis of the results of the Henmon- 
Nelson Test of Mental Ability and the Pasture 
Production Achievement Pre- Test. 


Summary, Generalizations and Implications 





Summary— The two experimental studies re- 
ported in this manuscript were designed to eval- 
uate the effectiveness of audio-visual aids used 
under different conditions. Experiment I in- 
volved the study of soil conservation while Exper- 
iment II involved the study of permanent pas - 
tures. 

The first experiment was designed to evalu- 
ate the effectiveness of audio-.visual aids in 
classroom situations where students used refer - 
ence materials that contained many pictures and 
illustrations. The results of this experiment 
showed that the experimental or audio-visual 
aids group was not significantly superior to the 
control group in achievement. 

The second experiment was designed to eval- 
uate the effectiveness of audio-visual aids in 
classroom situations when the audio-visual aids 
were especially selected to give students new 
visual experiences, that is, visual experiences 
they had not had or did not secure through other 
instructional materials. The results of this ex- 
periment showed that the experimental or audio- 
visual aids group was significantly superior to 
the control group in achievement. 

Generalizations—On the basis of these exper- 
iments the following generalizations may be 
made regarding the effectiveness of audio-visu- 
al aids used in the schools which participated in 
the respective experiments. It seems reason- 
able to assume that these generalizations will 
apply to other schools when students are in- 
structed under similar conditions. 
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1. 


~ 


The use of audio-visual aids in teaching voca- 


tional agriculture may or may not result in 
greater student achievement. 


. The results of these experiments indicate 
that audio-visual aids are ineffective in learn- 


ing situations when students use reference 

materials that contain pictures and illustra- 

tions which are similar to those contained in 
the special audio-visual aids. It may be re- 
called that in Experiment I the experimental 

group was not superior to the control group 

and that this group of students used reference 
materials which contained many pictures and 
illustrations. 


. The results of these experiments indicate 


that audio-visual aids are effective in learn- 
ing situations when they enable students to ac - 
quire related visual experiences they have not 
had or do not secure through reference ma- 
terials. It may be recalled that in Exper - 
iment Il a significant difference was found in 
favor of the experimental group which had the 
opportunity to acquire new visual experiences. 


Student achievement may be retarded when 
audio-visual aids merely repeat those visual 
experiences which students have the oppor- 
tunity to secure through other instructional 
materials. For example, in Experiment I 
the control or non-audio-visual aids group 
made a greater average gain than did the ex- 
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perimental group even though the experiment- 
al group had a slight advantage in intelligence 
and were significantly superior to the control 
group on the basis of pre-test results. The 
reason may have been that the control group 
had an advantage of being able to study new 
materials while the experimental group was 
viewing visual materia!s which were similar 
to those found in the reference materials. 
When time is consumed in providing visual 
experiences the student does not need, there 
may be a reduction in the efficiency of learn- 


ing. 


Implications —It seems clear that learning is 
not always enhanced by the use of audio-visual 
aids. 

Moreover, teachers cannot assume that aud- 
io-visual aids which have been effective with one 
group of students will be equally effective with 
other groups of students. 

Audio-visual aids should be selected and 
used in light of the previous experience of the 
learner and the nature of other instructional ma- 
terials that are to be used. 

Finally, each teacher should give careful con 
sideration to the usual experiences that each 
group of his students have already had and those 
that may be obtained through the use of other in- 
structional materials before making a decision 
regarding the selection and use of audio-visual 
aids in respect to any unit of work in the course 
of study. 











ANALYSIS-OF-VARIANCE MODELS AND THEIR 
USE IN A THREE-WAY DESIGN 
WITHOUT REPLICATION 


DONALD M. MEDLEY, HAROLD E, MITZEL 
ARTHUR N. DOI 
Muncipal Colleges of New York City 


THE TECHNIQUE of analysis of variance has 
enjoyed rapidly increasing popularity in the trea:- 
ment of data collected in educational experiments 
and surveys. Although the interpretation of the 
results of such analyses is closely dependent on 
the assumption made about the model underlying 
the use of the technique, most textbooks have 
offered little or no guidance about the choice of 
models or their relationship to the conclusions 
to be drawn. Asa result, some of the potential- 
ities of the analysis of variance are not being 
realized, and the technique has on occasion been 
misused. It is the purpose of this paper to at- 
tempt to state and illustrate as simply as possi- 
ble the implications of the use of each of three 
elementary analysis-of-variance models, using 
data gathered in a series of observations of 
classroom teachers. 

In the belief that the verbal behavior of a 
teacher is a valid measure of the social- 
emotional climate of her classroom, Withall (9) 
developed a technique for classifying statements 
made by a teacher during a typical class period 
into seven categories. Using this technique, 
Mitzel and Rabinowitz (6) made two series of four 
visits each to each of four classroom teachers 
at intervals of about one week and made tallies 
of each teacher's remarks independently until 
approximately 100 had been reported. The pro- 
portion of statements which fell into the learner- 
centered category, when transformed to an an- 
gle by the arc sine transformation (4), is known 
as the Climate Index. The 32 values of this In- 
dex obtained by the two observers on four visits 
to the four teachers are shownin TablelI. These 
are the data that will be used for illustrative pur- 
poses in the discussion to follow. 


Model I Analysis 





Let us suppose that the purpose of our exper - 
iment is to discover whether or not these two ob- 
servers, 1 and 2, did detect differences among 
these four teachers’ behaviors on the four partic- 
ular occasions on which they were visited with 
respect to the dimension measured by Withall’s 
Climate Index. Our interest is in differe nces 
that cannot be attributed to chance. The fact that 
one of the teachers—or one of the observers— 





might be ill or fatigued during a particular visit 
could give rise to a ‘‘chance’’ difference ia thio 
sense. 

Let us write Xjj,, the score that teacher | 
gets on visit ) from observer k, as follows: 


Xijk = M+ Tj + Vj + Ox + Ajj + Bik + Cjk + Fijk 


with terms on the right defined as follows: 


M is a constant for all scores. 

Tj is a fixed constant for teacher i. 
V, is a fixed constant for visit j.— 
O; is a fixed constant for observer k. 


The three terms Tj, Vj, and Oy are usually re- 
ferred to as ‘‘main effects.’’ There will be four 
values of Tj; namely, T,;, Tz, T;, and T,, cor- 
responding to each of the four teachers; four val- 
ues of Vj corresponding to the four days on which 
all four teachers were visited, and two values ot 
Ox, corresponding to each of the two observers. 

Tj may be referred to as the ‘‘teacher’’ effect, 
Vj as the ‘‘visit’’ effect, and Oy as the ‘‘observ- 
er’’ effect. Also: 


Ajj is a constant for teacher | on visit j. 
Bik is a constant for teacher | and observer k. 
Cjk is a constant for observer k on visit ). 


These three terms are known as ‘‘interaction 


effects’’—Ajj is the ‘‘teacher-visit’’ interaction, 


Bik the ‘‘teacher-observer’’ interaction, andC j, 
the ‘‘observer-visit’’ interaction. Ajj takes on 
16 different values: A,,, Az, Ays, Ayg, Ax, 
etc., up to Agy; Bik takes on 8 values, and Cj, 
takes on 8 values. 

Thus we have defined 43 parameters of which 
10 are main effects, 32 are interaction effects, 
and one is a general mean. The hypotheses we 
shall test and the conclusions we shall draw will 
apply only to these parameters. We will apply 
these conclusions to the behavior of the four 
teachers only as a result of an assumption that 
the Climate Index scores behave as though they 
were made up of these 43 additive parameters. 

Eijk, or the residual, is definedas what - 
ever is left in an individual observation Xj jk 
when the seven parameters which enter into it 
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TABLE I 


CLIMATE INDEX SCORES (IN DEGREES) ASSIGNED BY TWO OBSERVERS TO THE 
VERBAL BEHAVIOR OF FOUR TEACHERS ON FOUR VISITS 





Teacher Observer ll 





] (1) ¢ 49 
(2) ‘ 53 


(1) 3% 46 
(2) ¢ 48 37 


(1) 37 37 
(2) 35 38 30 


(1) 33 50 40 
(2) 38 47 40 


TABLE Il 


ANALYSIS OF VARIANCE OF CLIMATE INDEX SCORES 
UNDER MODEL I ASSUMPTIONS 


= = == = == = 


Source of Degrees of Expected Observed 
Variation Freedom Mean Square Mean Square 





Teacher Effect 3 o? +1/3 ST," 348. 792 


Visit Effect 3 2+41/3 Pv," 22. 042 


Observer Effect LO," 3.125 


Teacher- Visit 
Interaction 7+ 1/9 Diy" 127. 819 


Teacher- Observer 
Interaction * + 1/3 Bi,” 4.792 


Visit-Oberver 
Interaction 1/3 DC jy" 5. 375 


Residual { 4.597 
Total 
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are subtracted. For example, E,,, is defined 
as: 


Eis) = Xyy, > M- T, - Vy -O, - Ay - By - Cy, 


Ej; is not a parameter; we realize that if the 
experiment could be repeated using the same 
observers, the same teachers, and making the 
visits on the same occasions, the 43 parameters 
would be the same, but the 32 values of Ej jk 
would change. 

We assume that, if the experiment were re- 
peated many times, the values of Xjjx would be 
normally distributed with mean M + Tj + Vj +Ox+ 
Aij + Bix + Cjx and variance o*. 

M may be so chosen that the mean of any set 
of parameters (except M) is zero; that is, the 
mean of the four Tj is zero, as also are the 
means of the Vj, Ox, Ajj, Bik, and Cjk. 

Now it is possible to state the hypotheses we 
wish to test and to test them in terms of Model 
I. The principal hypothesis we are interested 
in may be written: 


Hg: T; = Tz = T; = Ty = 0 


We hypothesize that the four teacher effectsare 
equal; and, since their mean is zero, that they 
are all equal to zero. Ho states, then, that 
there is no real teacher effect at all. 

We will not go into the derivation of expected 
mean squares here (see reference 8), but will 
merely present them in Table Il. This table 
shows an analysis of variance with the ‘‘expect- 
ed’’ value of the mean square for each effect. 
The ‘‘expected’’ mean square is the mean value 
of all of the mean squares that would be obtained 
if the experiment could be repeated an infinite 
number of times by the same observers visiting 
the same teachers on the same occasions. 

Note the expected mean squares for the teach- 
er effect and for the ‘‘residual.’’ It is apparent 
that if Ho is true, and therefore that all the Tj’s 
are zero, then the second part of the expected 
mean square for teachers vanishes, and the 
teacher effect has the same expected mean 
square as the residual, viz., o*. 

The ratio of two independent estimates of the 
same variance is distributed according to Snede- 
cor’s F distribution with n, and n, equal to the 
respective numbers of degrees of freedom on 
which the two estimates are based. If Hg is ten- 
able, then we can tell how large a value of F we 
should expect to arise by chance, from an exam- 
ination of the table of the F distribution. Accord- 
ing to this table, F will exceed 3.86 only 5 per- 
cent of the time, and 6.99 only 1 percent of the 
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time if is true. 

Table Il also shows the values of these mean 
squares observed in our experiment. The ‘‘teach- 
er’’ mean square actually obtained was 348.792, 
and the residual mean square was 4. 597, so that 
the F ratio is 75.87. The probability that an F- 
value so large or larger would occur (if Hg is 
true) is less than .01, so we reject Hg and con- 
clude that T, = T, = T, = Ty = 0 is not a tenable 
hypothesis. There is evidence, then, that dif- 
ferences among the Climate Index scores of 
these four teachers on these four visits were 
detected by these two observers. 

Similar hypotheses may be set up regarding 
the Vj, Ox, Aij, Bik, and Cjk. For the data in 
Table Il, we reject the hypothesis that the Aij 
(teacher-visit interactions) are all equal to zero 
and retain some doubts about the Vj (visit main 
effects) since the probability associated with the 
observed F under the hypothesis is between . 01 
and .05. We accept the hypotheses that the Oy, 
Bik» and Cj, are all zero. 

If we now discard the parameters not shown 
to be significantly different from zero, we may 
rewrite our equation as follows: 


Xijk = M+ Ti + Vj + Aij + Bijx . 


Pooling the sums of squares and the degrees 
of freedom corresponding to the mean squares 
which estimate the same variance simplifies the 
analysis of variance to the form shown in Table 
III. * In other words, since the mean squares for 
observers, for teacher-observer interaction, 
and for visit-observer interaction are now all 
seen to estimate the same variance o*, we add 
the degrees of freedom and the sums of the 
squares for these four sources of variation to- 
gether, and obtain a new estimate of error based 
on 16 degrees of freedom rather thanon9. This 
new estimate of o* as 4. 687 is a more precise 
estimate than the previous one of 4.597, since 
it is based on a larger number of degrees of 
freedom; that is, it is probably closer to the true 
value. 

We can now retest our hypothesis about the Vj, 
we find it remains in the ‘‘region of doubt,’’ i. e., 
between the .01 and . 05 significance levels. 

The design in Table II was a three-way design 
with one observation per cell; our analysis has 
revealed that the simpler design in Table II, 
a two-way analysis with two observations per 
cell, may be substituted. 


Model II Analysis 


We have seen how an appropriate model may 
be set up for the analysis of the data in Table I 





“Editor's Note: There is at present no uniform agreemmt relative to the problem of preliminary teste of 
significance and subsequent pooling of insignificant effects. This is an important theoretical and 


practical problem. 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 24 


TABLE Ill 


CONDENSED MODEL I ANALYSIS OF VARIANCE 





Source of Degrees of Expected Observed 
Variation Mean Square Mean Square 


Teacher Effect o* + 1/3 ET," 348. 792 





Visit Effect o? + 1/3 EV," 22.042 


Interaction o* + 1/9 Phy 127. 819 


Residual (Pooled) o? 4. 687 
Total 31 





TABLE IV 


ANALYSIS OF VARIANCE OF CLIMATE INDEX SCORES 
UNDER MODEL II ASSUMPTIONS 





Source of Degrees of Expected Observed 
Variation Freedom Mean Square Mean Square 





Teacher Effect Oe” + apy” + 40%," + Bay” 348.792 
Visit Effect Oe" + 20ty® + 40y9° + Boy" 22. 042 
Observer Effect Oe” + 4019” + 40y9" + 1600" 3.125 


Teacher-Visit 
Interaction Oe" + ory’ 127. 819 


Teacher -Observer 
Interaction Oe" + 40," 4.792 


Visit-Observer 
Interaction * + d0yq° 5. 375 


Residual 4.597 
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when the problem of interest is whether or not 
real differences among the four teachers on the 
four visits were detected by the two observers. 

Suppose, however, that we are interested 
not only in these four teachers but in other sim- 
ilar teachers, and not only in the four particular 
occasions on which they were visited, but any 
occasion on which they might have been visited, 
and not only in these two observers but in other 
similar observers. Our results are meant to 
be generalized to populations of teachers, visits, 
and observers. The model appropriate to this 
case is Model II. 

In postulating Model II, we construct a ‘‘hy- 
per-population’’ generated by an infinite popula- 
tion of observers visiting an infinite population 
of teachers on an infinite population of occasions 
under all of the various sets of conditions which 
might obtain. Paradoxically, we shall see that 
only nine rarameters are needed to specify this 
‘*hyper-population’’ completely. 

We begin the specification of Model II by writ- 
ing the score Xjjx given to teacher i on visit j 
by observer k as follows: 


Xijk = M + ti + Vj + O, + Ajj + Di_ + Chk + Cijk 
where 


m is a constant for all values of Xj jx; 

tj is a constant for all scores of teacher i 
over all observers and all occasions; * — 

vj is a constant for all teachers and observ- 
ers on occasion ); 

0, is a constant for observer k over all teach- 
ers and all visits. po 


Also 


ajj is constant over all observers of teacher 
i on visit j; 

bjx is constant over all visits of observer k 
to teacher i; » 

Cjk is constant over all teachers on a single 
visit ) by a single observer k. 


Finally 
Cijk = Xijk~ ™M - ty - Vj - OK - aij - Dik - Chk 


As before, ejj, is the part of Xj, that is not ac- 
counted for by any of the seven components de - 
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fined above. 
lf we rewrite this equation as follows: 


Xijk -M=ti + Vj + Ok + Ajj + Dik + Chk + Cijk, 


square both sides, and take the means over all 
possible values of Xjjk in the population, we may 
write 


8. ab 2 2 2 2 2 2 e 
Oy = 9 + Jy + 9g + Sty + Fg + Fyg + Te + CO 
variance terms, where 


o% is the total variance of the Xjj_ about their 
mean m, 

of is the variance of the tj about their mean, 
zero, and the remaining terms are expressions 
of the variances of the various effects, each 
about a mean of zero. 

The assumptions involved in the use of Model 
Il are that tj, vj, Ox, aij, Dik» Cjk, and ejj_ are 
all random variables, independently and normal- 
ly distributed with zero means and variances of, 
Oy, 98, Sty» Tor Tyo, and og, respectively. Un- 
der these assumptions the covariance terms all 
become zero, and the expectations of the m van 
squares in the analysis of variance of the scores 
in Table I are as shown in Table IV. 

Our principal purpose in planning the experi- 
ment in this case has been to decide whether or 
not there is evidence indicating real differences 
in the behaviors of teachers in a population of 
teachers that would be observed by ‘‘any’’ ob- 
server on ‘‘any’’ occasion, with respect to the 
dimension measured by the Climate Index. The 
null hypothesis implies that all the tj are equal, 
which is equivalent to saying that of is zero, We 
therefore, may state this hypothesis as follows: 


a i 
Ho’ : % = 0. 


In testing Hg, the analogous hypothesis in 
Model I, we used the ratio of the observed teach- 
er mean square to the observed residual mean 
square, because, under the null hypothesis, the 
two mean squares would have the same expected 
values. But, the expected value of the teacher 
mean square in Model II is (from Table IV): 


0% + ty + 4t, + Boz . 


Even if of = 0, this expected value will not equal 
ve unless ofy and io are also equal to zero. We 


“ote that whereas in Model I, Tj takes on only four values Tj, T2, 73, and Ty» ty in Model II can take 
on any of a very large number of values, one for each teacher in a population of teachers. Also, T, is 
the mean of the eight scores assigned by two observers to teacher i on four visits, while t, is the 
mean of a very large number of scores that wuld be assigned by a great many observers on each of a 
great many similar visite. Similarly, vj and o, are defined in terms of populations of visits, and ob- 


servers, ami take on many different values. Because the values 
drawn at random from a population of values ti, they are usually 


» to, tz, and tj, are regarded as 
as "random effects," in contra- 


distinction to T), To, 2 and T),, which are called "fixed effects." The interaction effects “5 Diy 
reg 


and Ci,, are also 


as drawn at random from a population of such effects. 
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therelore, set up the auxiliary hypotheses 
H,' : %y= 0 
H,' to = 9 


The expected mean square for teacher-visit in- 
teraction (from Table IV) is 


o% ‘ 20¢y 


If H,’ is true, this becomes, simply, 0g and 
we see that the teacher-visit mean square esti- 
mates the same population variance as is es ti- 
mated by the residual mean square. We may, 
therefore, test H,' by comparing the ratio of 
the two mean squares with the entry in the F- 
table for n, = 9 and n, « 9 degrees of freedom. 
The value of F so obtained indicates that H,' 
may be rejected. A similar test of H,’ indi- 
cates that it may be accepted. In other words, 
ofy is different from zero while of, is not. 

The expected mean square for teachers may 
now be written as: 


We + 2oty + Bo; 


since we have accepted the hypothesis that of, 
is zero. If Hg’ is true, the teacher mean square 
has the same expected value as the teacher-vis- 
it interaction mean square. H,' may now be 
tested by comparing the ratio of the correspond- 
ing observed mean squares with the tabled val- 
ues of F for n, = 3 and n, = 9 degrees of free- 
dom. 

If H,' had not been accepted, that is, if of, 
cannot be set equal to zero, no exact test of Ho' 
would be possible, although an approximate test 
is available (7). 

Tests of similar hypotheses concerning 0§, 
and of indicate that these two variances may be 
taken as equal to zero. The mean squares for 
observers, teacher-observer interaction, and 
visit-observer interaction, therefore, all have 
an expected value of og. The corresponding 
sums of squares may then be pooled with the res- 
idual sum of squares to get an estimate of o% 
based on 16 degrees of freedom. The resulting 
analysis of variance table is shown in Table V. 
As was the case with Model I, it appears thata 
two-way design with two observations per cell 
is an appropriate one for the analysis of these 
data with Model II assumptions. 

If we wish to test the hypothesis H,' : oy =0, 
we must compare the observed mean square for 
visits with the teacher-visit interaction mean 
square rather than with the residual mean square 
as we did in Model I. 

Note that both Hg' and H,' are accepted in 
the present analysis (Fg' = 2.73, Fy’ = <1, 00) 
although in Model | the hypothesis Hg, and anal- 
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ogous to Hg’, was rejected while that analogous 
to H,' remained in doubt. Even though these ob- 
servers detected differences among the behav- 
ives of these teachers on these occasions, the 
differences were not great enough to justify the 
conclusion that other teachers like these would 
also be found to differ if other observers had ob- 
served them on other occasions. 


A Mixed Model Analysis 





As a third example, suppose we are interest- 
ed in whether the two observers employed in the 
study which yielded the data in Table I can de- 
tect significant differences in the verbal be hav- 
iors of teachers in a population of teachers (from 
which the four teachers who were actually ob- 
served may be regarded as a random sample) by 
observing them on any occasion (regarded as 
chosen at random from a population of occasions). 
We are not concerned about the ability of other 
observers to discriminate teachers in this anal- 
ysis. 

The model appropriate for investigating this 
problem is called a ‘‘Mixed Model’’ and may be 
written as follows: 


Xijk =m' +tj' + vj’ +O,’ + aij’ + Dix’ + Cik' + Ci jk’ 


In this model, m' is the general mean; tj’ is de- 
fined as the mean of all the scores (in deviation 
form) that a particular teacher i wouldget from 
these two observers on all occaSions on which 
she might be visited; vj' as the mean of allof the 
scores (in deviation form) that all the teachers 
would get from the two observers on a particu- 
lar visit j; and Ox' as the mean (in deviation 
form) of all the scores assigned to all the teach- 
ers on all visits made by observer k. O,' isa 
‘*fixed effect’’ which takes on only these two val- 
ues; tj’ and vj' are ‘‘random effects’? which 
could take any of a large number of values. The 
interaction terms ajj', bik', and cjk' as in pre- 
vious models represent effects peculiar to teach- 
er i on visit j}, to teacher i when observed by 
observer k, and to visit j by observer k, re - 
spectively. Finally, ejjx’ is the residual por- 
“son Of the individual observation Xjj_ not account - 
ed for by any of the other seven hypothesized 
components. 

We must assume that tj’, vj", aij’ , a. CjK ‘ 
and ejjk' are random variables independently 
and normally distributed with zero means and 
variances o%?, ot, Oy, O46» O'Y, and o'?, re- 
spectively. In addition to these six variances 
we have postulated two other parameters O,' 
and O,'; our hypotheses will be phrased in 
terms of these eight parameters. Table VI 
shows the expected values of the mean squares 
for the data in Table I under the Mixed Model 
assumptions. The obtained mean squares are 
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TABLE V 


CONDENSED MODEL II ANALYSIS OF VARIANCE 





Source of Degrees of Expected Observed 
Variation Freedom Mean Square Mean Square 





Teacher Effect Oe" + 20ty’ + Boy" 348. 792 


Visit Effect Oe" + 20ty* + Boy" 22. 042 


Teacher-Visit 
Interaction Oe + 20ty 127. 819 


Residual Ve 4. 687 





TABLE VI 


ANALYSIS OF VARIANCE OF CLIMATE INDEX SCORES 
UNDER MIXED MODEL ASSUMPTIONS 





Source of Degrees of Expected Observed 
Variation Freedom Mean Square Mean Square 





Teacher Effect 3 2 4 2oty™ + Boy" 348. 792 
Visit Effect 3 + 20ty" + 80," 22. 042 
Observer Effect 2 + 40y"* + 40t."* + 16Z0,"" 3.125 


Teacher -Visit 
Interaction + 20ty"* 127, 819 


Teacher-Observer 
Interaction 4. 792 


Visit-Observer 
Interaction ' ‘ 5.375 


Residual 4.597 
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also given. 

The hypotheses Ho'': 046 = 0, HH," : of = 
0, and H,"': o'¥, = 0, proposing that the respec - 
tive interaction effects are zero, are tested ex- 
actly as in Model I, in accordance with the ex- 
pected mean squares. Hg'' is rejected by H,"' 
and H,"' are accepted for the data in Table I. 

Under the assumptions of the Mixed Model, 
the expected mean square for observers is: 


a'* + 405g + 4op2 + 1620," *. 


The corresponding observed mean square is 
3.125. Since this is actually less than 5. 597, 
our estimate of o' 2 alone, it is apparent that 
not only o'f, and 043, but D0,_'* as well, must 
be regarded as zero. If 0{? + 05% = 0, thenboth 
0; and 05 must be zero. We can then dropfrom 
our model the three terms: bi,y', Cjk', and Ox', 
since all values of them are zero. The revised 
model is: 


Xijk =m’ + ty’ + vy" + aay’ + eijK’ 


and the expectations together with the obtained 
mean squares (after pooling) are shown in Table 
V. The conclusions will then be the same as 
those reached under Mode! II assumptions. 


Discussion 


The selection of the mathematical model to 
be used in the analysis of an experiment should 
be guided by a consideration of the use to which 
the results are to be put by the experimenter 
himself, and, if publication is contemplated, by 
others who read his report. It is convenient to 
think about the applications of experimental re- 
sults in terms of the population of experiments 
whose results are to be predicted on the basis 
of a given investigation. In the case of the try- 
out of Withall’s technique which has been used 
for an example in this paper, we have tried to 
show how considerations about the population to 
which the findings are to be applied influence 
the choice of model. Let us briefly consider 
the problem anew, in terms of populations of 
experiments. 

Suppose we are interested in answering the 
following question: Could the differences in av- 
erage Climate Index scores that these two ob - 
servers noted when they visited these teachers 
have arisen solely as the result of ‘‘accidental’’ 
conditions prevailing during the visits, or did 
the behaviors of the teachers really differ in 
these respects? This question can be restated 
as follows: If the experiment had been per- 
formed just as it was, with the same observers 
and the same teachers on the same occasions, 
but if things had been different—if it had rained 
or if there had been a fire drill during the pre- 
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ceding period—would the teachers’ Climate In- 
dex scores have had the same rank order? We 
now see that we are interested in a population of 
experiments performed with the same teachers 
and observers on the same occasions; hence, 
our main effects T,, Tz, etc., O, andO,, and 
V,, Ve, etc., as well as our interaction effects 
A:,, Aye, ete., By, Byg, etc., amd C,,, Cy2, 
etc., are all fixed and Model I is clearly indicat- 
ed. 
The limited applicability of Model I is appar- 
ent from this example. We can make no infer- 
ences about Climate Index scores of other teach- 
ers, about the skill of other observers, nor even 
about the behaviors of these four teachers on 
other occasions. Unless the observers, teach- 
ers and occasions, all three, are of particular 
interest for some reason, the experiment above 
is inappropriate here. 

We are almost certainly interested in the be- 
haviors of the teachers at other times than when 
they were visited; so we should regard the four 
visit effects associated with the four occasions 
on which the visits were made asa sample from 
a population of visit effects associated with many 
occasions, in any of which the teachers’ behav- 
iors are important. Therefore, let us say that 
the visit effect is random—in other words, let 
us set up a population of visit effects, of which 
these four are a random sample. 

Similarly, we are interested in other teachers 
than these four, so we set up a population of 
teachers. We then havet,', t,', t;', and t,’ 
as random samples from a t' -population, v,', 
v2', V3', andy," as random samples from a 
v' -population, and similarly with the interac- 
tions between them. 

Suppose, however, we are trying out Withall’s 
technique as one that may be adopted inan exten- 
sive study that is being planned, and that if the 
technique ‘‘works’’ it will be used in the study by 
the same observers who conducted the tryout. 
We may then argue that there is no need to con- 
sider a population of observers, so we postulate 
but two observer effects, O,' and O,', whichare 
regarded as fixed. We now see that we wish to 
generalize to a population of experiments us ing 
these two observers, but using various teachers 
who will be visited at various times. This leads 
us to the mixed model described above. 

However, if we contemplate publishing the re- 
sults of this experiment, we should consider that 
a reader may read our results with a view to de- 
ciding whether or not he should adopt the tec h- 
nique for use in a study he is planning. Now he 
is thinking of a population of experiments which 
in addition to those we are thinking of also in- 
cludes experiments with other observers ‘‘like’’ 
these two. To justify generalization to such a 
population we must analyze our data under Mod- 
el Il assumptions. If we publish results of a 
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Model I or Mixed Model analysis, we should 
realize and point out in the discussion of our re- 
sults that our findings have limited application. 

The choice of a model depends upon the hy- 
potheses to be tested, which in turn specify the 
manner in which the data are collected. It fol- 
lows, then, that the selection of a model must 
be made before the data are collected if the con- 
clusions are to have any validity. Whether a 
particular variable should be regarded asa fixed 
or random effect depends on the logic of the par- 
ticular situation, as well as upon the purpose of 
the experiment. Many studies that involve indi- 
viduals (teachers, pupils, etc.,), as a main ef- 
fect will want to designate this main effect as 
random. On the other hand, experiments involv- 
ing teaching methods, amounts of practice, etc., 
should probably regard the methods or amounts 
of practice as fixed. It is difficult to see, for 
instance, how two or three methods of teaching 
algebra could logically be regarded as a random 
sample from a population of methods for teach- 
ing algebra. Other effects, such as schools, 
observers, and occasions may, depending on 
whether the principle of randomization has been 
employed, be considered random in some stud- 
ies and fixed in others. 

It seems safe to say that more attention 
should be given to mixed models which are like- 
ly to be applicable more often than either Model 
I or Model I. 

If a mixed model or Model II is decided upon 
as appropriate for a particular experiment, the 
investigator should define the population from 
which he will draw his sample. It is clear that 
the only really satisfactory way of describing 
a sample is to tell how it was obtained. A de- 
scription of the sample used in an experiment is 
absolutely necessary to the interpretation of the 
results, for this information provides the only 
basis a reader can have for applying the results 
of an experiment to his own situation. 

The reader anxious to pursue the topic of 
analysis-of-variance models further should con- 
sult the classic paper of Eisenhart (2), to which 
the present discussion is deeply indebted, and 
the discussion in McNemar’s revised textbook 
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(5:281-342). Schultz (8) has recently published 
a paper giving simple procedures for determin- 
ing expectations of mean squares under Model 
II and mixed model assumptions. Mood (7:342- 
349) treats the topic from a more mathematical 
point of view. Fisher's original paper (3) re - 
mains as readable as anything subsequently pub- 
lished. 
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THE MULTIPLE GROUP METHOD OF FACTOR 
ANALYSIS AND ROTATION TO A SIMPLE 
STRUCTURE HYPOTHESIS’ 


PAUL HORST and K. W. SCHAIE 
University of Washington 


THE MULTIPLE group method of factor an- 
alysis developed by Thurstone (9) is one of the 
most rapid computational techniques available. 
It is much more rapid than the centroid method 
also developed by Thurstone. The success of 
the multiple group method in reproducing the 
correlation matrix depends somewhat on the se- 
lection of the variables to be included in each 
group. 

In describing the multiple group method, 
Fruchter (3) suggests that a preliminary cluster 
analysis of the correlation matrix may provide 
the basis for the grouping of variables. Gutt- 
man (4), on the other hand, points out that a fac- 
tor analysis study like any other scientific in - 
vestigation should begin with certain initial hy- 
potheses and proceed to test these hypotheses. 
The implication seems to be that in setting up a 
factor study one should have a hypothesis as to 
the primary variables in the particular domain 
under investigation and should include measures 
of these variables in one’s test battery. This 
point of view seems to be in line with that of 
Thurstone (10), although his rotational proced- 
ures have not in general presupposed any knowl- 
edge of the primary factors in the system to be 
investigated. 

In any case, suppose one has a hypothesisas 
to the variables in the correlation matrix which 
have appreciable loadings on eachof the primary 
factors. One can then construct a hypothe sis 
matrix consisting of ‘‘0’’ ’s and ‘‘1’’ ’s, such 
that if a test is assumed to have no loadings for 
a given factor, a zero will appear in the corres- 
ponding position, and if it is assumed to have a 
loading a ‘‘1’’ appears. One may then use this 
hypothesis matrix as a basis for grouping the 
variables to be factored by the multiple group 
method. 

In a recent article Horst (5) has shown how 
some of the computations outlined by Thurstone 
for the multiple group method can be simplified. 
This simplified procedure gives the same results 
as Thurstone’s method and likewise provides no 
unique solution in the sense that the results 
depend on the particular order in which the 





groups of variables are arranged. If there are 
n groups then the number of different solutions 
isn! A method for obtaining a unique solution 
which does not depend on the order of the groups 
has been outlined by Adcock (1). However, this 
method does not allow for correlations among 
the primaries since it involves an orthogonal so 
lution. 

A method has been developed by Tucker (11) 
for rotating any arbitrary factor loading matrix 
to a simple structure hypothesis. A simplified 
approximate solution has been given by Horst (6) 
for the case of a centroid factor loading matrix. 
If a simple structure hypothesis is available the 
multiple group method and Tucker's rotational! 
method may be combined to effect a great econ- 
omy of labor in computing the simple structure 
factor loading matrix. This solution is unique 
in that it does not depend upon the order of groups 
The procedure for this method will be described 
with the aid of a numerical example. The vari- 
ables used are from a study by Schaie(7) which 
was concerned with investigating the dimension- 
ality of rigidity. The details of this study will 
be reported in a separate paper by Schaie (8). 

Table I gives the intercorrelations for ten var- 
iables for a sample of 216 cases. The highest 
correlation in each column has been used as an 
estimate of the test communality. A prelimin- 
ary inspection of this table together with certain 
ser considerations suggested that the vari- 
ables be grouped as follows: (1, 2,3), (4, 5, 6), 
and (7,8,9,10). As pointed out by Guttman (4), 
it is entirely possible for a variable to be includ- 
ed in more than one group. For example, vari- 
able 7 because of its relatively high correlation 
of .404 with variable 3, might have been includ- 
ed in both the first and third groups. Table II 
gives the hypothesis matrix according to which 
the variables were grouped. Table III] shows the 
the factor loadings obtained by the computation- 
al procedure outlined in (5). 


Computational Procedure: 
th row and column sums are computed for 
the factor loading matrix (Table I). It is im- 
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TABLE I 


CORRELATION MATRIX 
(N = 216) 
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portant that the row and column sums add up to 
the same total, in our example 7.79. The col- 
umn and row sums are to be used for checking 
purposes and it is, therefore, important that 
they be completely accurate. The successive 
steps in the computations required are as fol- 
lows: 

1. Calculate the matrix of sums of squares 
and sums of cross products by columns for the 
factor loading matrix (Table III). Enter the re- 
sults in the upper left-hand quadrant of Table 
IV. It is not necessary to write the entries be- 
low the diagonal since these are implied. In get- 
ting sums of cross products include also the 
column of row sums with the preceding columns 
of the factor loading matrix and enter the results 
in row C, columns IA through 3A of Table IV. 
Add the entries in each column above this row, 
including the implied infra-diagonal entries, to 
see that the sum is the same as the correspond- 
ing entry in the C row. For example, in the 
second column of Table IV we have .520 + .792 
+ .263 = 1.575. These last operations check 
the accuracy of the sums of squares and sums 
of cross-products. 

2. The elements in the upper right quadrant 
of Table IV are calculated from Tables II and 
Il]. The entry in the i’th row and j’th column 
is the sum of cross-products of entries from 
the i’th column of Table III and the j’th column 
of Table II. Since the entries in Table II are 
all either zero or one this means that these 
sums of cross-products are simply sums of se- 
lected elements from columns of Table II. For 
example, the entry 2. 160 in row 1A, column 
1B of Table IV is simply the sum of the first 
three entries in Column I of Table II]; namely, 
712 + .76 + .68 = 2.160. The entry in row 2A, 
column 3B is likewise obtained from the second 
column of Table III and the third column of Ta- 
ble Il. Itis .15+ .07+.18+.10=.500. The 
elements in the C row, columns 1B to 3B, are 
similarly obtained from the corresponding col- 
umns of Table II and the column of row sums 
in Table III. For example, the entry in row C, 
column 2B, is .39 + .89 + .89=2.170. Tocheck 
the entries in the upper right quadrant of Ta- 
ble IV, see that each column sum is equal to 
the corresponding entry in the C row. For ex- 
ample, in column 1B we have 2.160 + .010 + 
.010 = 2. 180. 

3. Calculate the row sums for each row in 
the upper half of Table IV and enter in column 
C, rows 1A through 3A. Include the im plied 
entries to the left of the diagonal in the upper 
left quadrant. Get the total of these row sums 
and enter in the space immediately below. In 
the example this total is 14.456. Now total the 
entries in the C row up to, but not including, 
the C column and see that this total checks with 
the one for Column C. 
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The next set of computations consists in pre- 
multiplying the matrix in the upper right quad- 
rant of Table IV by the inverse of the matrix in 
the upper left quadrant. There are numerous 
ways in which this can be done, a number of 
which are described by Dwyer (2). However, we 
shall outline a method which calls for patterns of 
responses which in our opinon are less complex 
than other methods we have seen. The proced- 
ure to be described carries a complete set of 
checks. It consists of two parts—the forward 
solution and the back solution. The forward so- 
lution follows the same routine as previously de- 
scribed by Horst (5), but will be repeated here 
for the sake of completeness. An additional 
worksheet ruled up in the same manner as Ta - 
ble IV is required for the forward solution. This 
new worksheet is shown in Table V. The steps 
for the forward solution are as follows: 

4. Copy row 1A of Table IV into row 1B of 
Table IV. 

5. Get the reciprocal of the entry in row 1B, 
column 1A of Table IV and enter in row 1B, col- 
umn R at the extreme left of Table IV. This is 

1 
ria * . 46555. 

6. Enter 1.000 in each diagonal position in 
the upper left quadrant of Table V. 

7. Fold Table IV between row 1B and 2B and 
lay on Table V so that row 1B of Table IV lies 
immediately above row 1B of Table V. 

8. Copy into row 1B of Table V the signs op- 
posite to those in the row above, that is, row1B 
of Table IV. 

9. Multiply each entry in row 1B of Table IV 
by the corresponding entry in column R to the 
extreme left and enter these products in row 1B 
of Table V. The first entry will be 1.000 with 
the negative sign. This is a check on the recip- 
rocal. The next entry in the example is . 46555 
x .§20 = . 242 with the negative sign. 

10. Sum the entries in row 1B of Table V, 
from column 1A through 3B, and enter the result 
in column § of the same row. In the example, 
this is 3.392 with negative sign. This sum should 
be the same as the number immediately to its 
left within the limits of rounding error. 

11. Fold Table V between columns 2A and 3A 
and lay this folded table on top of Table IV so 
that column 2A of Table V lies immediately to 
the left of column 2A of Table IV. 

12. Get the sum of the products of corres- 
ponding entries in columns 2A of the two tables 
and enter the result in column 2A, row 2B of 
Table IV. In the example, this is 1.000 x .792 
+ (-.242) * (.520) = . 666. 

13. Move column 2A of Table V one column 
to the right so that it lies immediately to the left 
of column 3A of Table IV. Sum the products of 
corresponding entries in the two columns and 
enter the result in column 3A, row 2B of Table 
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TABLE IV 








3A 1B 2B 


—_—_—_—_— <3. $$$ - —_— 


.597 | 2.160 . 750 
. 263 .010 1.410 
. 966 .010 .010 











.575 1.826 | 2.180 2.170 


esa ees | Rae Pee 
620 .597 | 2.160 .750 


. 666 119 |- .513 1.228 
.779 |- .499 - .418 























1. 000 


-_ — 





—————— — A 


-1.000 - .242 - .278 |-1.006 - .349 - .517 -3.392 
-1.000 - .179 .7T70 -1.844 - .347 -2.601 

-1. 000 . 641 537 -1.900 -1.721 

= QUEER 














TABLE VI 
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TABLE VII 


TRANSFORMATION MATRIX 





pb 1 2 3 Cc 


. 61501 . 826 -. 403 - .304 1. 000 
. 49672 .014 . 964 - . 267 1. 000 
. 52632 -. 007 . 004 1. 000 1. 000 








TABLE VIII 


SIMPLE STRUCTURE MATRIX 











1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Ra 
Cc 





TABLE Ix 


CORRELATIONS AMONG 
PRIMARIES 
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IV. In the example this entry is . 597. 

14. Calculate the remaining entries through 
column C in row 2B of Table IV in the same 
manner as in steps 12 and 13, placing column 
2A of Table V immediately to the left of the ap- 
propriate column in Table IV. 

15. Sum the entries in row 2B of Table IV 
through column 3B and enter the sum in column 
8 of this row. This should give the same num- 
ber as the one immediately to its left within lim - 
its of rounding error. In the example, the two 
numbers are 1.732 and 1.731. 

16. Calculate the reciprocal of the first en- 
try in row 2B of Table IV and enter the result 
in column R to the far left in the same row. 


= 
This is BEG 1. 5015. 


17. Fold Table IV between row 2B and row 
3B and lay on Table V so that row 2B of Table 
IV lies immediately above line 2B of Table V. 

18. Repeat operations 8 through 10 togetand 
check row 2B of Table V. 

19. Fold Table V between columns 3A and 
1B and, beginning with column 3A of Table IV 
repeat operations 12 through 15 to get and check 
row 3B of Table IV. 

20. Repeat operations 8 through 10 to get and 
check row 3B of Table V. 

If there had been more than three factors the 
operations given in steps 8 through 15 would 
have been repeated until all the B rows had been 
filled. 

The forward solution is now complete. Table 
VI is next drawn up. Enter ‘‘-1'' ’s in the diag- 
onal positions of the right hand section and zer- 
os in the C column. The back solution proceeds 
as follows: ; 

21. Copy row 3B with signs reversed from 
columns 1B through 3B of Table V into column 
3A of Table VI. In the example, these entries 
are respectively from top to bottom, -. 641, 
-.537, 1.900. 

22. Fold Table VI between row | andthe row 
of column designations so that the latter is fold- 
ed under and place Table VI so that row 1 lies 
immediately below row 2B of Table V. Be sure 
the vertical rulings of Tables V and VI are in 
alignment. 

23. Multiply row 2B of Table V by row 1 of 
Table VI and enter the result in row 1, column 
2A of Table VI. This is (-. 641) x (-.179) + 
(-1. 000) x (. 770) = -. 655. 

24. Slide Table VI up one row so that row 1 
lies immediately below row 1B of Table V. 

25. Multiply row 1B of Table V by row 1 of 
Table VI and enter the result in row 1, calumn 
1A of Table VI. This is (-. 655) x (-. 242) + 
(-. 641) * (~. 278) + (-1. 000) (-1. 006) = 1. 343. 

26. Place row 1 of Table VI immediately be- 
low the C row of Table IV. 

27. Multiply row C of Table IV by row 1 of 
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Table VI and enter the result in row 1, column 
S at the extreme right of Table VI. This entry 
should be zero within limits of decimal error. 
It constitutes a check on row 1 of Table VI. In 
the example this entry is . 003. 

28. Next fold Table VI between rows 1 and 2 
so that row 2 is the first row showing at the top 
of the sheet. Place Table VI so that row 2 lies 
immediately below row 2B of Table V. 

29. Calculate and check the second and first 
entries in row 2 of Table VI as in steps 23 
through 27. 

30. Calculate and check row 3 of Table VI as 
in steps 23 through 27. If more than three fac- 
tors had been hypothesized we would c ontinue 
the operations given in steps 23 through 27 until 
all A columns of Table VI had been filled. 

This completes the back solution and gives 
the transformation matrix. This matrix, how- 
ever, must be normalized before it can be used 
to rotate the factor loading matrix to simple 
structure. The necessary computations for this 
next step are shown in Table VII and proceed as 
follows: 

31. Calculate the sums of squares of the en- 
tries in the first row of the left hand section of 
Table VI and enter the result in row 1, column 
D’ of Table VII. This is (1. 343)? + (-. 641)? = 
2. 644. 

32. The remaining two entries of Column D* 
of Table VII are computed as in step 31 from 
the corresponding rows of Table VII. 

33. Column D of Table VI contains the square 
roots of the corresponding entries in column D*. 
For example, the first entry in column Dis 

= 1. 6260. 

34. Column D™~! consists of the reciprocals 
of the corresponding elements of column D. For 
example, the first entry in column D™ is 

1 2 
ree 61501. 

35. Multiply each entry in column D™* by its 
corresponding entry in column D*. Thisproduct 
should be the same as the corresponding entry in 
column D within rounding error and constitutes 
a check on both columns D and D™'. 

36. Multiply each entry in row 1 of the left 
hand section of Table VI by the first entry in col- 
umn D~* of Table VII and enter the results in 
row 1, columns 1 through 3, of Table VII. For 
example, the entry in row 1, column 1 of Table 
VI is 1.343 x .61501 = . 826. 

37. Calculate the entries in rows 2 and 3, col- 
umns 1 through 3 of Table VII as in step 36, by 
using the corresponding rows from Table Vl and 
the corresponding entries from column D™* of 
Table VII. 

38. Calculate the sum of squares of the en - 
tries in row 1, columns 1 through 3, of Table 
VII and enter the result in row 1, column C of 
Table VII. This is (. 826)? + (-. 403)" + (-.394)? 
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= 1. This number should be unity within limits 
of decimal error and constitute a check on steps 
31 through 37. 

Columns 1 through 3 of Table VII constitute 
the transpose of the matrix which transforms 
the arbitrary factor matrix of Table III into 
the oblique simple structure matrix of Table 
VU. To get Table VII proceed as follows: 

39. Calculate the sums of products of cor- 
responding entries from the first row of Table 
III and row 1, columns 1 through 3 of Table VI 
and enter the result in the first rowandcolumn 
of Table VIII. This is .72 x .826+ (-.06) x 
(-.403) + (-.11) x (-. 394) = . 66 

40. Calculate the remaining entries of col- 
umn 1 in Table VIII as in step 39, by using row 
1 of Table VII and the corresponding rows of 
Table III. Include also the » row of Table III. 

A convenient aid in making these computa- 
tions is to fold Table VI. immediately above 
row 1 and place this row immediately below the 
row of Table III for which the corresponding ele- 
ment of column 1 in Table VIII is to be computed. 

41. Sum the entries in column 1 of Table VII 
down to but not including the © row and enter 
this sum in the C row of column 1. This num- 
ber should be the same within limits of round- 
ing error as the one immediately above it. In 
the example both of these numbers are 1. 82. 

42. Calculate and check the entries in the re- 
maining two columns of Table VIII by multiply- 
ing rows 2 and 3 of Table VII, respectively, by 
the appropriate rows of Table II. 

If it is desired to calculate the correlation 
among the primary vectors these may be com- 
puted from the transformation matrix given in 
Table VII, according to methods outlined by 
Thurstone (10). The correlations among the 
primaries in our example are given in Table 
IX. 


Mathematical Appendix: 

The mathematical basis of the method out- 
lined above may be summarized as follows: We 
let 

a be an (n X s) arbitrary factor loading ma- 

trix of n tests and s factors, 

F be a simple structure hypothesis matrix 

of the same order as a. 





The ij’th element of F is zero if a near zero fac- 
tor loading is assumed for that position, and 
unity if an appreciable loading is assumed. 


b is the simple structure factor matrix, 
H is the matrix which transforms a to b. 


o isa diagonal matrix. 


Horst (6) has shown that the least square so- 
lution for the transformation matrix is 
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H = Mo (1) 
where 

M = (a'a)"’ a' F (2) 
and if Dy'm is a diagonal matrix of the diagonal 
elements of M'M 

C= D-ty mM (3) 
Then 

b = aH. (4) 


In the example given in this paper, Table I 
is the matrix F and Table [Il isa. The upper 
teft hand quadrant of Table IV contains a'a and 
the upper right quadranta'F. The left section 
of Table VI is the transpose of (a'a)™' a'ForM'. 
Column D~' of Table VII gives the elements of 
the matrix o. The three columns of Table VII to 
the left of column C give the transpose of Mo or 
H'. Table VIII gives the simple structure matrix 
aH or b. 
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AN ANALYSIS OF VARIANCE METHOD FOR 
DETERMINING THE EXTERNAL AND IN. 
TERNAL CONSISTENCY OF AN 
EXAMINATION 


WILLIAM J. MOONAN* 
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San Diego, California 


Introduction 


IN A PREVIOUS paper (1), a statistical mod- 
el of a response to an examination item, which 
arose as a result of the application of a simple 
experimental design, was specified and an anal- 
ysis of variance was derived for testing and es- 
timating certain parameters in the model. Ad- 
ditionally, a procedure was given for estimating 
the intra-class correlation, called the coeffi- 
cient of internal consistency, of the responses 
to the items and the index of internal consisten- 
cy, of the responses to the items and the index 
of internal consistency of the examination scores. 

The purpose of this paper is to demonstrate 
how to test for the existence and make and esti- 
mation of the external consistency of the respon- 
ses—that is, a measure of the intra-class cor- 
relation of the responses to the items in repeat- 
ed administrations of the examination to the 
same analysis. To accomplish this, another 
observation model will be specified anda new 
analysis of variance will be made, and tests of 
hypotheses for certain parameters in the model 
will be obtained. A need for a measure of the 
external consistency comes about as a result of 
our desire to obtain information relative to the 
stability of the responses or scores on the e x- 
amination over some set of administrations or 
time interval. 


The Model and Its Analysis 





We will assume that the value of a response 
to an item is composed of additive components 
which will involve the following effects: w fora 
general level effect; < (i) for the effect of ith of 
I items; a (a) for the effect of the ath of A ad- 
ministrations; o (s) for the effect of the sth oi S 
subjects, and all two and one three factor inter- 
actions defined by their indices. The expected 
value of the normally distributed variables with 
IAS degrees of freedom is 


(1) é y(ias) = pw +i) + a(a) + o (s) + plia) 


“The opiniors expressed are solely those of the auther 





+ O(is) + Was) + €(ias) 


where i=1,...,]; a=1,...,A; e=#1,...,8. 


In order to obtain the analysis of variance of 
(1) we will re-parameterize the model by impos- 
ing certain conditions on the parameters of the 
main and interaction effects. We assume that 
the summation of any effect is zero if the sum- 
mation runs over the parscripts which corres- 
pond to the index or indices of the parameter. 
For example, we assume that >» <(i) = 0 and >} 

i i 
@(ia) = 0. We shall also assume that the variance 
of a response, y(ias), is o*. Table I indicates 
this fact as well as providing the covariances for 
various combinations of i, aands. fhe primes 
on the indices of that and other tables denote dif- 
ferent values of the indices. 

From Table I we note that the responses of 
the subjects to the same items in different ad - 
ministrations are intra-correlated to the extent 
of p(A) and we call this number the coefficient of 
external consistency. Also the responses to dif- 
ferent items in the same administration are in- 
tra-correlated to the degree of p(I) and we call 
this number the coefficient of internal consist- 
ency. A function of p(I) was denoted p(H) in (1) 
and called the index of internal consistency. A\l- 
so in Table I we note that the intra-correlation 
of different items in different administrations is 
assumed to be p(I) p(A). The responses of di f- 
ferent subjects are, of course, taken to be sto- 
chastically independent under any conditions. 

Due to the covariances specified in Table I, 
the analysis of variance is not as simple as us- 
ual. The development, given in some detail 
here, is made along the lines shown in (1) and 
originally inspired by Nandi (2). We first make 
an orthogonal transformation on the y(ias) 
in order to obtain a set of transformed var- 
iables which are independent for different it- 
ems both within and between administrations. 
The equations of this transformation 
are: 


are in no way official; nor are they to be 


construed as representing those of the U.S. Naval Personnel Research Field Activity or Bureau of Nav- 


al Personnel. 
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TABLE I 


VARIANCE AND COVARIANCE TABLE FOR THE y(ias) 


bk; Sat ie 


| a #a' 





4 
~ p(l)p(A)o? 








TABLE IU 


VARIANCE AND COVARIANCE TABLE FOR THE z(has) FOR s = s' 





oe 
h=h' 


a - p(t) of). + a rl. - Lo()) 


~4 














.. 
ptA)o*[ . ott) plA)or[ 1 ' + Fle) | 


A ——EE 


TABLE Il 


VARIANCE AND COVARIANCE TABLE FOR THE wiIbs) 





b= b' be=A 


o*{ 1 + FT p())[1 - e(A)] o*[ 1 + F-Ip(1)}[1 + A-1p(a)] 


0 0 


eS onl 
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(2) 2(has) = [ hy(hvl as)- Zy(ias)] /V hha; 
l= 
I 
h=1,...,I-1; z(las) = 2 ylias) /= VTy(.as). 


Table II gives the variances and covariances 
of these transformed variables for various com- 
binations of h and a for the same subject. It is 
not necessary tc indicate the conditions for s/s' 
since, from Table I, we assume that complete 
independence exists between the responses of 
different subjects. Table Il shows some inter- 
esting things. Among them is the fact that the 
transformation achieved the desired effect of 
obtaining variables which are independent from 
item to item within and between administrations 
and that these variables are intra-correlated to 
the degree of o(A) for the same items in differ- 
ent administrations. Therefore, we will have 
to transform both the z(has) and z(las) sets 
again. This time, however, we will attempt to 
gain independence between administrations. 

The form of the transformation given in (2) 
is first used on the z(las) variables. We call 
these new variables w(Ibs), b = 1,...,A-1 and 
w(IAs). Their variances and covariances are 
given in Table II]. We see from Table III that 
we have succeeded in deriving two sets of vari- 
ables, the w(Ibs) which have A-IS degrees of 
freedom and the w(IAs) which have S degrees of 
freedom. These two sets are independent be- 
tween administrations and between subjects. 

Because of an invariance property of the or- 
thogonal linear transformation given in (2), we 
know that 


SAI , SAI 
(3) ZExzr{ ylias)-§ y(ias)} = 22> z(ias)- 
sal sal 


E 2(ias)}* 


Also, because of the transformation, the right 
side of (3) can be broken into the two stochas- 
tically independent parts, 

SAI-1 


- SA 
(4) £2 ¥[ 2(has)-Ez(has)]* + Ez[ 2(las)- 
sa h Sa 
Ez(las)] *. 


With the transformation we also decompose the 
right side of (4) into 


SA-1 a 
(5) z & [ w(lbs)- Ew(Ibs)] + Z[ w(IAs)- 


Ew(tAs)] *. 


The z(Jas) = V1 y(.as) and the expected value of 
the z(Ias) is uw + a(a) + o(s) +Wlas). Using this 
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information and (3) and (4), we learn immediate- 

ly that the left side of (4) can be expressed as 
SAI 

(6) zzz y(ias)-y(. as)- < (i)-(ia)-@(is)-€(ias)]”. 
sa 


Recall that these variables are intra-correlated 
to the degree p(A) between administrations. Us- 
ing the fact that w(IAs) = VIA y(..s), the & w(IAs) 
= p + o(s) and the right side of (4) expressed as 
a function of y(.as), we learn that (5) is equiva- 
lent to 


SA i od 
(7) IZZ[y(.as)-y(..s)-a (a)-W(as)] “+IA> [y 
sa 5 


(. . 8)-p-o(s)] *. 


The sums of squares of (7) are respectively as- 
sociated with A-18 and S degrees of freedom and 
have variances equal to o#{ 1+I-1 p(I)][ 1-p(A)] and 
o*{ 1+I-1 p(1))[ teAeT p(A)] 

Returning to (6) we recall from Table II that 
those variables were not independent between ad- 
ministrations. The sums of squares of (6) can 
be split into two independent parts by using the 
form of the transformation given by (2) again. 
This time we let x(ias)=y(ias)-y(. as) and trans- 
form the x(ias) to t(ibs) and t(iAs). Because 
t(iAs)= VA x(i. s)= VA[ y(i.s)-y(..8)] and & t(iAs) 
=&[ y(i. s)-y(. .s)]= ¢(i)+@(is), (6) is identically 
equivalent to 

SAI 
(8) zrel y(ias)~-y(. as)-y(i. s)+y(. . s)-d(ia) 


SI ; 
~€(ias)} *+ ALY [ y(i. 8)-y(. . 8)- ¢ (i)- O(is)}* 


whose terms are associated with I-] A-1S and 
I-18 degrees of freedom and have variances equal 
to o[ 1-p(I)][1-p(A)] and o*{ 1-p(1)][ 1+A-1p(A)}. 
Thus the left side of (3) has been partitioned 
into the four parts represented in the sums of 
squares given in (8) and (7). Each of the parti- 
tions is independent of the others and the vari- 
ables of each have different variances. Mor e- 
over, the variables within the sets are independ- 
ent. Consequently we can make an analysis of 
variance on each of the four sets by the usual 
methods. These four analyses are assembled in 
Table IV which shows various sources of varia- 
tion, degrees of freedom, their associated sums 
of squares which are expressed in a form most 
suitable for calculation, and the expected mean 
squares of the ‘‘error’’ terms of each of the 
four analyses. The hypotheses within the parti- 
tions of Table IV are to be tested by forming an 
F test by making a ratio of the mean square for 
the hypothesis to its associated mean square for 
error. The four error mean squares could be 
used in various ways to test hypotheses regarding 
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TABLE IV 


AN ANALYSIS OF VARIANCE TABLE FOR THE y(ias) 





Source of Degrees Sum Expected 
Variation of Freedom of Squares Mean Squares 





H[ 1A]; @ (ia) = 0 I-] A-I D-A-B+G 





F{ IAs] I-1 A-i 5-1 o*{ 1 ~ p(I)}[ 1 - p(A)] 


o=xej exe; | 
jt | 


i 1};< (i) = 0 a6 
E o*{1 - p(I)][ 1 + A-1 p(A)] 


1S]; 8 (is) = 0 


' 

' _ 
w) 
a 
_— 





‘ 
} 


>> 
wm 
LJ 


- A]; @ (a) = 0 
F 


AS]; ¥ (as) = 0 o*{ 1 + F-1 p(t)][1 -p(A)] 





j 
i 


H| M]; n= 0 


E| 8]; o (s) = 0 o*{ 1 + T-1 p(I)|[1 + A-1 p(A)] 


ne | 
‘| 





> 
7) 


Total 





- £[ Ezy(ias)}*/As8 = F>{ Py(ias)] °/A 


= Ef yzy(ias)) 2/18 = DD{ Pylias)) */1 
a is as i 


= Lf PLylias)] */IA = [| 22eylias)j*/1AS 
8 ia ias 


y*(ias) 


« YD Yy(ias)|*/s T= ZrIr 
ia 58 lias 
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p(A) or p(I). For instance, we could compare 
the mean square of E[ S|, as obtained from 
Table IV, with the mean square of E[AS] to test 
the hypothesis that p(A)=0. Also we couldcom- 
pare the mean square of E[IAS] and the mean 
square of E[AS] to test the hypothesis that p(I) 
= 0, 


The analysis given in Table IV may be de- 
rived in other ways than the one given here. 
One of these involves the analysis of (1) as a 
mixed model three-way classification with all 
effects of (1), except those involving an ‘‘s’’ 
parscript, considered as fixed. Those others 
are assumed to be random. This set-up involves 
expressing the p(A) and p(I) in terms of compon- 
ents of variance. Another way of deriving Table 
IV is by considering the ‘‘dyadic anova’’ illus- 

- trated by Tukey (3). This amounts to using the 
A variables, classified by Land S, as an A di- 
mensional vector. The basic framework of the 
analysis is given by placing A=l in the terms of 
Table IV. However, instead of sums of squares 
for the H{I|, E[IS] H|M] and E[S] partitions, we 
would display a matrix of sums of squares and 
cross-products for each partition. The analy- 
sis of Table IV could could be regained by ortho- 
gonally rotating these matrices and appropri- 
ately designating the diagonal terms of these ro- 
tated matrices. The ‘‘dyadic anova’’ has much 
to recommend it, particularly because it util- 
izes more information from the same data than 
does the analysis given in Table IV, althoughat 
the expense of some multivariate normal as- 
sumptions. Also it enables one to see ata 
glance whether or not the E[IS] are likely to be 
correlated the same as the E[S] between admin- 
istrations. 


Some Interval and Point Estimations 





The problem of testing hypotheses on the p(A) 
and p(I) is probably not as important as their 
estimation. Let us estimate p(A) first. Wede- 
fine F{A] to be the mean square of E[ S|] divided 
by the mean square of E[AS]. Then the (l-a)% 
confidence interval for p(A) is given 


(9) FLAJ-FIAL) _ . 4¢q)> FIAL-FIAU] _ 
F{[A]+A-I F[AL] F{ A] +A-T F{AU] 





where F{[AU] is the upper (1-a/2)% point of the 
F distribution with S-I and A-1 5-1 degrees of 
freedom. We abbreviate this as F{[AU]=F{5-1, 
A-1 8-1; 1-a/2]. Also F{ AL]=1/F{ A-1 5-1, 5-1; 
l-a/2]. A point estimate for p(A) is given 
by placing F{ AU]=1 in the right side of (9). It 
should be noted that when A=2, the product-mo- 
ment correlation coefficient of the subject’s scores 
in eachadministration approximates the empi r i- 
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cal estimate of p(A). These two estimates would 
be equal if the variances of the subject scores in 
each administration were equal. Otherwise the 
absolute value of the estimated product-moment 
correlation is greater than the absolute value of 
the estimate of p(A), say r(A). This is true be- 
cause if r(A) is expressed in the form ofacorrela- 
tion coefficient, it will have the same numerator 
as the product-moment correlation. However, 
the denominators will be different. The denom- 
inator of r(A) is the arithmetic mean of the two 
‘*subject’’ variances whereas the denominator ol 
the product-moment correlation is the geomet- 
ric mean of those variances. Since for the same 
numbers, the arithmetic mean is larger than the 
geometric mean, the absolute value of r(A) is 
less than or equal to the product-moment corre- 
lation of the subject scores. 

The (l-a )% confidence interval of p(I) is found 
by letting F{I] be the ratio of the mean square of 
E[AS] to the mean square of E[IAS] and substi- 
tuting into 


ao) FLJ-FU)] 
F(t} 1-1 F{IL) 


In (10), PIU) = FLAT S-1, (1 A-1 Sl; 1-a/2] and 
F{IL] =1/F{I-1 A-I 5-1, A-1 5-1; |-a/2]. The 
point estimate for p(I) is foundby setting F{IU] 
equal to | on the right side of (10). There are 
other ways of getting point and interval estimates 
of p(A) and p(I) from the entries in Table IV in- 
cluding ‘‘pooling’’ certain sums of squares and 
degrees of freedom. However, if the degrees olf 
freedom involved in the F's of (9) and (10) are 
large, those formulas should be adequate. It is 
of interest to note that necessarily p(A) > -1/ A-1 
and p(I) >-1/ F-1. 

Finally, let us consider the index of the in- 
ternal consistency, p(H). The (l-a )% confidence 
interval for p(H) is given by 


(11) 1-F{IL] / F{1]> p(H)>1-F{IU] / F{I}. 


If the confidence interval of (11) includes zero, 
the hypothesis that p(H) «0 is accepted. 


Fit] -F{NU) _ 
F{1) +T-1 F{IU] 


Summary 


An experimental model was specified which 
hypothesized the parametric form of a response 
by a subject to an item of an examination which 
was administered to the same subjects several 
times. An analysis of variance of this mode! 
was made whereby tests of certain hypothe ses 
associated with the parameters of the model 
could be made. Also we have shown how to make 
point and interval estimates and test for the ex 
istence of the coefficient and index of internal 
consistency as well as the coefficient of extern- 
al consistency. 
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SAMPLING ERROR DUE TO CHOICE OF SPLIT 
IN SPLIT-HALF RELIABILITY COEFFICIENTS 


FREDERIC M. LORD 
Educational Testing Service 
Princeton, New Jersey 


Abstract 


The formula and derivation are given for the 
sampling variance of the ‘‘random-halves’’ re- 
liability coefficient when all possible splits are 
sampled for a given group of examinees. When 
the number of test items is large, this sampling 
variance is found to be larger than that of the 
Kuder-Richardson formula-20 reliability coef- 
ficient by a multiplicative factor equal to the 
number of test items. The usual matched-halves 
reliability coefficient will likewise ordinarily 
have a similarly large sampling variance, since 
it likewise is based on an arbitrary split of the 
test items. An adaptation of the Jackson-Fergu- 
son ‘‘battery reliability coefficient, ’’ usedasan 
estimate of the matched-forms reliability of the 
test, avoids this relatively large sampling vari- 
ance. > e e 


IN OBTAINING a split-half reliability coeffi- 
cient, the items of a test are first of all divided 
into two equally numerous sets (half-tests) and 
the correlation between scores on the half-tests 
is computed. It will be assumed in the present 
discussion of reliability thatall such coefficients 
are ‘‘stepped-up’’ by any one of the usual, math- 
ematically equivalent formulas that do not as- 
sume equal score variance for the two half-tests. 
A fuller discussion of these formulas is given 
by Cronbach[2]. One of these formulas ap- 
pears as equation 1, below. 

Three kinds of split-half coefficients will be 
discussed in detail: 


1. A random -halves reliability coefficient is ob- 
tained when the choice of items for the half- 
tests is made at random. It is an estimate 
of the random-forms reliability of the test. 

. A matched-halves reliability coefficient is ob- 
tained whenever the two half-tests are pur - 
posely selected to be more nearly alike than 
random halves. It is an estimate of the 
matched-forms reliability of the test. A 
matched-halves coefficient tends to be higher 
than a random-halves coefficient, the amount 
of difference depending on the effectiveness 
of the matching. 

. An odd-even reliability coefficient is obtained 
when items are assigned to the two half-tests 
according to whether their serial number in 




















the test is odd or even. If the test items are 
arranged in effectively random order in the 
test, the odd-even coefficient isa random- 
halves coefficient; if the items are arranged 

roughly in order of difficulty, or are grouped 
by type or by subject matter, it is a matched- 
halves coefficient. 


It has been shown by Cronbach (2) that the 
Kuder-Richardson formula-20 reliability coeffi- 
cient (r2¢~) is the average of all split-half coeffi- 
cients computed from all possible splits of a giv- 
en test. Two conclusions are immediately obvi- 
ous: 


1. The random-halves coefficient and rzo are 
both estimates of the same parameter (the 
random-forms reliability of the test). 

. The sampling variance due to sampling of i- 
tems—‘‘type-2 sampling’’ [ 4] —is larger for 


the random -halves coefficient than for ra, ; 

hence, the former coefficient is a relatively 
inefficient estimator. The degree of this in- 
efficiency will be brought out in what follows. 


The sampling variance of the random -halves 
coefficient for type-2 sampling (which will be as- 
sumed hereafter unless the contrary is specified) 
has not been worked out previously. The large- 
sample formula is derived in the present article. 
It will be seen that when the number of test items 
(m) is large, the sampling variance of the ran- 
dom -halves coefficient is ordinarily of order 
1/m* whereas the sampling variance of ryo is 
[4] of order 1/m*. Thus the relative efficiency 
of the former coefficient is of order 1/m—obvi- 
ously very poor when m is large. This low ef- 
ficiency is not surprising when it is considered 
that the particular value obtained for the random - 
halves reliability depends on an arbitrary choice 
among all the 4m! /| (4m): ]* splits of the total! 
test into halves. 

Now the type-1 sampling variance (represent- 
ing fluctuations when examinees rather than 
items are sampled) will be of the same order of 
magnitude for both r,, and the random-halves 
coefficient. Since rz. is more efficient in type- 
2 sampling, it is clear that it is theoretically 
preferable to a random-halves coefficient. 

In many cases the research worker will wish 
to estimate the matched-forms rather than the 
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random -forms reliability of the test. The usual 
matched-halves coefficient is based on a less 
arbitrary split of the test items than is the ran- 
dom -halves coefficient, but the split is never- 
theless still arbitrary. Even if the total test con- 
tains no more than one other item matching any 
single given item, the total number of pairs of 


matching half-tests that can be formed is gym-1 
and each pair in general leads to a different and 
arbitrary estimate of the matched-forms relia- 
bility. 

Some less arbitrary and more etticient meth- 
od for estimating matched-forms reliability 
would be desirable. Such a method is briefly de- 
scribed at the end of the present article. 


Derivation of the Standard Error of the 
Random -Halves Coefficient 








The formula for the random-halves coeffi- 
cient is derived below. Some familiarity with 
the line of reasoning in { 4] will be helpful, but 
not essential, to the reader who wishes to fol- 
low the proof, 

The stepped-up split-half reliability coef - 
ficient may be written [cf.2, Table Ij: 


45 
tt’ 
r= “a (1) 


Here r, denotes the reliability of the total test 
composed of 2n items; Y is the score on the to- 
tal test (the test score is assumed to be the sum 
of the item scores, on which no restrictions are 
imposed); t is the score on a half-test composed 
of n items selected at random from the ‘‘popula- 
tion’’ of m = 2n items; t' is the score on the 
other half-test, so thatt+t' =; s¢' is a co- 
variance over examinees; and o2 is a variance, 
also computed over examinees. Greex letters 
are used for statistics pertaining to the total 
test, since for the present its 2n items will be 
treated as a fixed finite population within which 
all samples are drawn. 

It will be more convenient to work with ‘‘pro- 
portion scores’’ z«=t/n, z'=t'/n, and «y/2n 
=(z+2z')/2. We have 


i a , (2) 


The denominator of (2) is constant for all splits 
of the total test, so the problem is simply to find 
Var(szz'), the sampling variance of the numera- 
tor. 

Now [ cf. 4, p.14], by the formula for the co- 
variance of a sum, 


1 we 
Ses' “5 re si" > (3) 








(Vol. 24 


where s;\; is the observable covariance between 
item i and item i' (i= 1,...,n; i = m+l,n+2,..., 
2n). Similarly 


nn 
v5 
j 


= 


Var Szz' = s ; 


= Cov (sij' , Sjj") , (4) 


7 * 


-—Ms 


z 
j 


where Cov (sjj' , jj) is the sampling covar - 
iance between s;j' and (age gg over all pos- 
sible sets of items such thati#i', j#j'. Now 
Cov (8jj' , 8jj') is an observable quantity, since 
in principle it could actually be computed for any 
given set of 2n items by direct application of the 
definition. Since direct computation by this ap- 
proach would usually be totally unfeasible, the 
remaining problem is to rewrite (4) so as to 
make the computation more feasible, and further, 
sO a8 to expose the nature and magnitude of this 
sampling variance. 

The sums in (4) may be grouped to obtain 


n’-n n@-n 


iz) rr Cov (s4i" , 8jj') ’ 


Var Sz: = 5 [ 
n 


n’-n n nn 
2 y= Cov (sj , sjjr) + LTE Var sir] , 
if} i" ii ji ii ii 

(if it, j#j')- (5) 
The terms under any given set of summation 
signs in (5) are all the same, no matter what the 
value of the subscripts. Consequently, 


Var Szz' = +, [ (n - 1)* Cov (sgy, 81) + 2(n - 1) 


Covy(syJ, 81K) + Vargszy], (6) 
where G,H,I,J, = 1,2,...,2n. 

The inequality symbol (#) attached to ‘‘Cov’’ 
(or, later, to E) is to be taken as denoting that 
differing subscripts among those immediately 
following are to be taken as unequal. (6) may be 
expanded in terms of expectations to obtain 


Var Szz' = 4 [ (nm - 1)? ESGH*IJ - (n - 1)* 
(Bsyy)° + 2(n - NES 351K - 2(n - INE S13)" 
+ BS - (Es,))"] = 4 [ (n - )"FSou*1y 


+ 2(n - NES 1y9IK + Ei - n?(Esqy)"| . (7) 
Now, [4] 


Bo! gall 
Pou” 214) dy ig 3 *oH*’ © 
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where m = 2n and ml] = m(m - 1)(m - 2).... 
(m-r+1). Since 


= 8,,2= (=f Ts )? 
oan us “GH © yy 


(8) can be written 


E =z 
ESGH*I * ROLF sy)" - 


% ‘i Ji1GS13 


- av: (9) 


(7) rewritten in terms of summation rather 
than expectation signs becomes 


— ye (n ad 1 . ~ 2 “ y 
ee a “Gags © 


2(n-1) .. 
- 25 Es? 
‘is *T7° tsp 8) ae 
1 “o8 no 1 gy 2 
“at Batis Ta Bare) 0 


When terms are collected in (10) and 2n substi- 
tuted for m where convenient, it is found that 


2(n - 1) [2] 
Var s = [n(z rey" -~m 
ZZ n? aml 4)m f2 2] 


LE LszySy_ + (n - 2yml? Ez) - (11) 


IfJ #K 


In (11), all terms of order 1 and of order 1/n 
have cancelled out (a sum of multiplicity r is of 
order n*, at most), leaving an expression of or- 
der 1/n* . When all terms of order 1/n*® and 
smaller are dropped, (11) becomes 


2 2 
Var Sy: = (2Es,,)° - 4nD LE 8778 
zz' * pol reS1y 7 > 2 S1J8IK 
+ alk 5 ig! (12) 


In (12), the inequality signs have been dropped 
since their effect is of order 1/n’, at most. 

It will be convenient to express LEsfy | 
terms of the (observable) variance of 8, 3: 


- 
a? (813) = m® 228, - ( <zs,,)"*. (13) 
IJ IJ a iJ LJ 


Substitution of (13) into (12) yields the result 


2 
Var Sez: = —a| 2LtzLs )? -4nDzzrs 
az' = mol AFF sy ijk 2K 





+ m* o*(s73)] . (14) 
Now, 
rr=s = a -% ). (15) 
1 jK IK 1J)\]S1K 
By the formula for the covariance of a sum, 


1 


ae 2D = 8s (16) 
mK IK ~ "IC 

From (15) and (16), 
LY Ys; 8re¢ = m? Fs? . (17) 
13k 7 1 i 


The right side of (17) may be expressed interms 
of the observable variance of S]¢ {ef. equation 
13]: 


eng) Eegt paw 
1 a 1 - 
at «|: - mt 'F msg) (18) 


Substitution of (17) and (18) into (14) yields 
2 
Var Sz7z' = m? | o*(syy) ° 20*(s4¢ )) ‘ (19) 


(19) seems to be about the simplest form for 
expressing Var 8 z,' Although it involves 
roughly m*/2 interitem covariances (87), this 
is considerably fewer than the roughly m*/4 co- 
variances required by (4). 

The meaning of (19) is better understood with 
the help of a well-known identity from analysis 
of variance: 


1 


o* (sy) = =? 7% (Sty - a 


z= 


(sy. * s..)* + 


= 


x 
m 


ro a 


l ’ m - 
* pd FE ys - 8.3% 


where a dot indicates an average taken over the 
corresponding subscript. Now 8y, = *1¢ and 5 J 
= 8y¢ , 80 (20) becomes 


07 (83) = 20%(81¢) + 9* (Interaction) » (21) 


where o* (Interaction) is the last term of (20), 


being the interaction variance obtained when an 
ordinary analysis of variance is carried out on 
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the interitem variance-covariance matrix. 
Substitution of (21) in (19) gives, 


os aval 
Var Sz" m? ” (Interaction) and 
Finally, from (2) and (22), 
2 2 
Var r, = mo3 ad (Interaction) 9) 


(og , it will be recalled, is readily computable, 
being simply the variance of obtained scores on 
the total test.) 

Equation (23) is the large-sample formula for 
the sampling variance of the stepped-up random - 
halves reliability coefficient for the case where 
the sampling is over all possible ways of split- 
ting the test. The interaction term in (23) will 
vanish if and only if the interitem variance - 
covariance matrix has only a single common 
factor. Since the covariances are computed 
from fourfold-point correlations (phi coefficients) 
rather than from tetrachoric correlations, this 
matrix cannot be expected to reduce to a single 
common factor except in the limiting case where 
all items are of equal difficulty and all item in- 
tercorrelations are equal. 

If the m items in the test are themselves to 
be considered as a random sample drawn from 
an infinite pool of items, this brings in a new 
source of sampling fluctuation in addition to that 
arising from the choice of splits. Since the 
formula-20 reliability coefficient is the mean of 
the stepped-up split-half reliabilities obtained 
from all possible splits, it follows that the er- 
ror variance of the random -halves reliability 
coefficient due to both kinds of sampling is 
simply the sum of the error variance in equation 
a the type-2 error variance of ryq [ 4, eq. 
43). 


A Relative Efficient Coefficient for Estimat- 
ing Matched-Forme Reliability 


Large-sample formulas have been derived 
for the type-2 sampling variance of the random- 
halves reliability coefficient. This sampling var- 
iance is seen to be of order 1/m*, where m is the 
number of items in the whole test. This is in 
contrast to the sampling variance of the Kuder- 
Richardson formula-20 reliability coefficient 
(ry.), which is of order 1/m*. Since r,, is 
the mean of all split-half coefficients for a giv- 
en total test, rgo should therefore always be 
theoretically preferred to a split-half coef fi- 
cient whenever the halves would effectively be 
random -halves. 

For fixed m, the size of the sampling var i- 
ance of the random -halves coefficient depends 
on the degree to which the interitem variance- 
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covariance matrix approximates unit rank. The 

poorer this approximation, the more inefficient 

is the random-halves coefficient in comparison 

to the Kuder-Richardson formula-20 coefficient. 
At the same time, however, the poorer the ap- 

proximation, the less homogeneous the test, and 
hence, in most cases, the stronger the reason 

for estimating the matched-forms reliability 

rather than computing a random-halves or an Pao 
coefficient. 

Tne large-sample sampling variance ofa 
matched-halves coefficient will be iess than that 
of the random -halves coefficient because of the 
matching; however, since the splitof the items in- 
to matching halves is arbitrary, the sampling var - 
iance of the matched-halves coefficient will still 
be an order of magnitude larger than that of r,,. 
The exact sampling variance of the matched - 
halves coefficient has not been worked out. 

The researcher who wishes to estimate the 
matched-forms reliability of a test, but who is 
anxious to keep the sampling errors in his esti- 
mate to a minimum should use the formula for a 
‘battery reliability coefficient’’ described by 
Jackson and Ferguson [ 3, Chap. 6] instead of 
the usual matched-halves coefficient. Use ofthis 
formula for determining the internal consistency 
of a heterogeneous test has already been recom- 
mended elsewhere [ e.g., 1, D6.3]. The proce- 
dure is briefly as follows: 


Split the items into a number of sub- 
tests homogeneous on the characteristics 
used for matching (item difficulty may appro- 
priately be used as one such characteris- 
tic). For each subtest separately, obtain 
the examinees’ scores, compute their var- 
iance, and obtain rzo. Also compute the 
variance of scores on the total test. The 
total test reliability is now computed by 
applying Jackson and Ferguson’s formula 
for ‘‘battery reliability, ’’ which is simply 
the formula for the reliability of a sum of 
scores (subtests) in terms of the reliabil- 
ities of the scores (subtests) summed. 


The virtue of this procedure arises from the 
fact that it avoids an arbitrary grouping of items 
into half tests, while at the same time r, is 
computed only for subgroups (subtests) of items 
that are sufficiently homogeneous so that rao is 
appropriate. The coefficient obtained will bean 
estimate of the matched-forms reliability and 
will thus tend to be larger than the usual r,o for 
the whole test, assuming that the test items are 
sufficiently heterogeneous for matching to be a 
useful procedure. 

Since this coefficient eliminates the factor of 
arbitrariness characteristic of the matched - 
halves coefficient, it will have a corresponding- 
ly smaller sampling variance. The amount of 
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arbitrariness eliminated depends on the hetero- 
geneity of the test items, as reflected in the 

number of subtests necessary satisfactorily to 

eliminate heterogeneity and achieve effec tive 

matching of items within the subtest. Thus the 
smaller the number of subtests needed, the 

more arbitrariness eliminated by the use of the 
proposed coefficient, and hence the greater its 

efficiency as an estimator, relative to the usual 
matched-halves coefficient. 
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